dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
84 stars 22 forks source link

Gracefully handle empty CAS xmi files #193

Closed DavidHuebner closed 2 years ago

DavidHuebner commented 2 years ago

Describe the bug We encountered the scenario that our pipeline has created empty .xmi files (because the import data was not clean and contained empty documents). This is a cleaned example of the .xmi file:

<?xml version="1.0" encoding="UTF-8"?>
<xmi:XMI xmlns:xmi="http://www.omg.org/XMI" xmlns:tcas="http:///uima/tcas.ecore" xmlns:cas="http:///uima/cas.ecore" xmi:version="2.0">
    <cas:NULL xmi:id="0" />
    <tcas:DocumentAnnotation xmi:id="2" sofa="1" begin="0" end="0" language="en" />
    <cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="" />
    <cas:View sofa="1" members="2" />
</xmi:XMI>

Trying to open this file causes an error.

To Reproduce Steps to reproduce the behavior:

from cassis import load_cas_from_xmi
xmi="""<?xml version="1.0" encoding="UTF-8"?>
<xmi:XMI xmlns:xmi="http://www.omg.org/XMI" xmlns:tcas="http:///uima/tcas.ecore" xmlns:cas="http:///uima/cas.ecore" xmi:version="2.0">
    <cas:NULL xmi:id="0" />
    <tcas:DocumentAnnotation xmi:id="2" sofa="1" begin="0" end="0" language="en" />
    <cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="" />
    <cas:View sofa="1" members="2" />
</xmi:XMI>"""
load_cas_from_xmi(xmi)

Expected behavior I would like to see that cassis succesfully creates a CAS (that is empty). This is important to us because it avoids downstream problems.

Error message

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_32727/2857839666.py in <module>
----> 1 cas = load_cas_from_xmi(xmi)

~/.local/lib/python3.8/site-packages/cassis/xmi.py in load_cas_from_xmi(source, typesystem, lenient, trusted)
     41     deserializer = CasXmiDeserializer()
     42     if isinstance(source, str):
---> 43         return deserializer.deserialize(
     44             BytesIO(source.encode("utf-8")), typesystem=typesystem, lenient=lenient, trusted=trusted
     45         )

~/.local/lib/python3.8/site-packages/cassis/xmi.py in deserialize(self, source, typesystem, lenient, trusted)
    225                 # Map from offsets in UIMA UTF-16 based offsets to Unicode codepoints
    226                 if typesystem.is_instance_of(fs.type, "uima.tcas.Annotation"):
--> 227                     fs.begin = sofa._offset_converter.uima_to_cassis(fs.begin)
    228                     fs.end = sofa._offset_converter.uima_to_cassis(fs.end)
    229 

~/.local/lib/python3.8/site-packages/cassis/cas.py in uima_to_cassis(self, idx)
     66         if idx is None:
     67             return None
---> 68         return self._uima_to_cassis[idx]
     69 
     70     def cassis_to_uima(self, idx: Optional[int]) -> Optional[int]:

KeyError: 0

Please complete the following information:

jcklie commented 2 years ago

Thanks for the report, it is fixed in master and the next release