dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
84 stars 22 forks source link

Invalid Cas is created when loading/saving an .xmi #155

Closed DavidHuebner closed 3 years ago

DavidHuebner commented 3 years ago

Describe the bug For various CAS-xmi, I experienced that first loading the file and then saving it again using cassis leads to an invalid xmi. The newly created Cas-xmi is invalid with regards to two aspects:

  1. A duplicate xmi:id="1" exists
  2. Some annotations point to a sofa id that does not exist (this throws an error).

To Reproduce Steps to reproduce the behavior:

  1. Download and unzip the TypeSystem and example xmi from files.zip
  2. Execute the following code block to load/save the xmi:
import cassis
with open('TypeSystem.xml', 'rb') as f:
    typesystem = cassis.load_typesystem(f)

with open("cas_in.xmi", "rb") as f:
    cas = cassis.load_cas_from_xmi(f, typesystem=typesystem)

cas.to_xmi("cas_out.xmi", pretty_print=True)
with open("cas_out.xmi", "rb") as f:
    cassis.load_cas_from_xmi(f, typesystem=typesystem)

Expected behavior: Loading and saving the xmi should create a readable new xmi. Each element id (xmi:id) should only be given once.

Error message Error message is KeyError: 6 which occurs between some annotations are linked to sofa with xmi:id=6 that does not exist in the newly generated Cas-xmi.

Please complete the following information:

Additional context Please note that the CAS has the property that some annotations are not part of any index, but are only linked in other annotations. This might be the reason why the existing loading logic fails. If I add all annotations to the index, then I can load/save the xmi (although it still has a duplicate xmi_id=1).

jcklie commented 3 years ago

Please check whether it works now, I added a fix in master.

DavidHuebner commented 3 years ago

Hi Jan, thank you so much for your quick response!! All initial tests are positive, so I am hopeful that the issue is fixed. This helps us a lot, so thanks again! David