catalpa-cl / inceptalytics

An easy-to-use API for analyzing INCEpTION annotation projects.
16 stars 7 forks source link

Lack of information about uploaded files that weren't opened in INCEpTION UI #19

Open yanirmr opened 2 years ago

yanirmr commented 2 years ago

Hello,

There has been a phenomenon that appears to be a bug -

The bug The from_remote method does not return information about files not opened in the INCEpTION user interface.

To Reproduce To reproduce the behavior, follow these steps:

  1. Using INCEpTION's UI, upload files with annotations (in my case, XMI files)
  2. Use the Project.from_remote method and run the example of the Inceptalistics interface
  3. View the example of the Inecptalitics interface and note that you have statistics only for files that have been opened before.
  4. In the INCEpTION user interface, open one more file
  5. Reopen the Inceptalitics interface example and note that you also have statistics for the file that you opened in step 4.

Expected behavior Information about the entire files in the project, independent of whether the file has been opened.

Please complete the following information:

It is unclear whether the problem is caused by a bug in the INCEPTION API, the Python client, or the from_remote method.

FYI - @reckart

reckart commented 2 years ago

Note that INCEpTION creates the CAS data for annotators lazily. So if an annotator has never opened a document, then no CAS data for that user has been initialized, so it cannot be exported. If you wanted to have data for such users, you'd need to fall back to exporting a CAS XMI from the original source document.

yanirmr commented 2 years ago

I appreciate your prompt response. In some way, I suspected the matter was related to this. However, I am wondering what the alternative solution may be. I have uploaded 100 documents and their annotations. To obtain statistical information about all these documents, I am trying to use Inceptalitics (through the INCEPTION API). Do I have to open each document through each user in order to obtain information about the document and annotations? Can it be forced even though the documents have not been opened?

@reckart - We can move the discussion elsewhere if that's better.

zesch commented 2 years ago

Do we get the complete list of files from the project? Inceptalytics could then (as a non-default option) try to copy the source CAS in case an annotator has never touched a file. Not sure that is feasible though.

reckart commented 2 years ago

You get the project ZIP - there is a full list in there of course :)

INCEpTION internally obtains a copy of the "INITIAL_CAS" of a document whenever trying to access an annotator CAS and there is none. Inceptalytics probably should do the same thing transparently - even as a default option? Depending on the type of analytics, it might skip that though if it is clear that looking at the INITIAL_CAS would just waste time and not yield additional information. And/or it could cache information obtained from the INITIAL_CAS so that it wouldn't have to extract the same details over and over again, e.g. when working on a project with a large number of users (cf. crowdsourcing).

yanirmr commented 2 years ago

Hi @zesch

What are your thoughts on this? Could we describe this feature in more detail? It may be possible to implement this in the near future if this is the case.

Best

simulacrum6 commented 2 years ago

Hi,

we are not sure how fast we will be able to implement the required changes.

The problem arises from the way in which the internal representation of annotations is built from the xmi export. We are currently reading most information from the xmi files under the annotation/ directory in the exported zip. IIUC, Those are only created once an annotator has viewed a particular source file, resulting in the behaviour @yanirmr described. (Another undesirable side effect of this is that annotators who did not view any source files will not be listed under Project.annotators.)

I think the behaviour that @reckart described is a reasonable way of handling it. The INITIAL_CAS is not exported, correct? If so, what does the INITIAL_CAS contain? Just the Sofa, Sentences and Tokens?

reckart commented 2 years ago

@simulacrum6 You are right, it looks like the INITIAL_CAS is not exported in the secondary format (it is included in the binary SER format though) - but that should be easy to change on the INCEpTION side.

reckart commented 2 years ago

Ok, SNAPSHOT builds of INCEpTION main and 25.x branches now also include the INITIAL_CAS in the secondary format. So based on a recent build, you could start adding a fallback to the INITIAL_CAS whenever an annotator CAS is not available.