dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
85 stars 22 forks source link

XMI Serialization breaks Nested Feature Structures #121

Closed Daedo closed 4 years ago

Daedo commented 4 years ago

Describe the bug During the Serialization of a cas to xmi annotations are assigned new ids. This creates two issues:

  1. If an annotation contains a nested feature structures the inner feature structures no longer point to their original values.
  2. Any references to feature structures that are not kept within the cas breaks since ids get reassigned.

To Reproduce Steps to reproduce the behavior:

For 1:

  1. Implement a feature structure mimicking a linked list with a node feature that is able to contain other node features.
  2. Serialize the cas => After serialization the nodes no longer point to each other. (Trying to deserialize is likely to fail)

For 2:

  1. Simply select any annotation and store their ID to a file before serializing the cas.
  2. Read the created xmi and the id
  3. Try to get the same annotation via the stored id.

Expected behavior The annotation id (xmiid) should not change during serialization..

Please complete the following information:

Additional context I'm trying to write a recommender for inception and noticed that the cas returned by the recommender is malformed.

jcklie commented 4 years ago

Do you have a minimal CAS that triggers this error?

Daedo commented 4 years ago

cas.zip

This might not be the smallest cas, but is is the smallest where I saw this error occur:

Case 1: Id reassignment breaks nested feature structures The cas contains what is basically a linked list:

is a node pointing to its next node if there is one. Additionally they point to the containing the actual data. Take the list with ids 236, 242, 248. After reading the xmi and then writing the xmi again, the node with id 236 still exists, it also points to the node 242, however node 242 is now a token (so it has now an invalid datatype). The other nodes are entirely missing. Now for the case 2: Id reassignment breaks references by id Now consider the annotation with id 29, it statically stores a parentid to node 193 (in this case it is really just storing the number rather than the feature itself). You can see that annotation 193 "type" annotation. After reading and writing this also became the id of a token.
jcklie commented 4 years ago

Thank you for reporting and using cassis! I fixed it hopefully in master by keeping IDs when loading XMI, I cannot recall why I regenerated them. You may want to try it out. You can just use the master via pip using python -m pip install git+https://github.com/dkpro/dkpro-cassis. I will make a release soonish.

reckart commented 4 years ago

How do you ensure that new FSes get do not get IDs that have already been used?

jcklie commented 4 years ago

I save the maximum id and generate from there on.

reckart commented 4 years ago

Is there some kind of controlled failure / error message when the IDs overflow?

jcklie commented 4 years ago

How do they overflow? In python, numbers cannot overflow but convert to a bigint equivalent.

reckart commented 4 years ago

Ok. It could be that UIMA (Java) uses int for XMI IDs. Having had a cursory look at the UIMA code, I expect that XMI IDs > MAX_INT and possibly such smaller 0 might cause the UIMA Java deserialization to fail.

jcklie commented 4 years ago

No further input, closing.