dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
85 stars 22 forks source link

Loading feature structures referencing NULL throws #116

Closed dmitriydligach closed 4 years ago

dmitriydligach commented 4 years ago

I have several hundred XMI files that I'd like to interact with using CASSIS. I successfully was able to read 50-60 of them (thank you for addressing the issues I recently pointed out!). However, one XMI files causes a problem. Unfortunately, I am not able to give you access to this file, but perhaps you have some ideas what the problem might be from looking at the error?

The code is roughly this:

type_system_file = open(<path>, 'rb')
type_system = load_typesystem(type_system_file)
xmi_file = open(xmi_path, 'rb')
cas = load_cas_from_xmi(xmi_file, typesystem=type_system)

Here's the error:

Traceback (most recent call last):
  File "./dtrdata.py", line 124, in <module>
    inputs, labels, masks = dtr_data.read()
  File "./dtrdata.py", line 46, in read
    cas = load_cas_from_xmi(xmi_file, typesystem=type_system)
  File "/usr/local/lib/python3.6/site-packages/dkpro_cassis-0.2.8.dev0-py3.6.egg/cassis/xmi.py", line 40, in load_cas_from_xmi
    return deserializer.deserialize(source, typesystem=typesystem)
  File "/usr/local/lib/python3.6/site-packages/dkpro_cassis-0.2.8.dev0-py3.6.egg/cassis/xmi.py", line 161, in deserialize
    target = feature_structures[target_id]
KeyError: 0
jcklie commented 4 years ago

From the code, it looks like the entry references another entry which is not defined. Can you look in your data and tell me what entry has ID 0? I would guess it is <cas:NULL xmi:id="0"/>.

dmitriydligach commented 4 years ago

Ok, I do see this in my XMI file:

xmi:version="2.0"><cas:Sofa xmi:id="1" sofaNum="2" sofaID="UriView" ...

but also:

<textsem:PersonTitleAnnotation xmi:id="25240" sofa="8" begin="163" end="167" id="0" typeID="0" discoveryTechnique="0" ...

<textsem:DateAnnotation xmi:id="25492" sofa="8" begin="2482" end="2486" id="0" typeID="0" discoveryTechnique="0" ...

and others.

Does this help?

reckart commented 4 years ago

Are all these from the same file??

reckart commented 4 years ago

Ah - some have an „id“ attribute with value 0 but the „xmi:id“ is always different.

jcklie commented 4 years ago

@dmitriydligach Do you have a href="#0" somewhere in your XMI?

dmitriydligach commented 4 years ago

@reckart Sorry, to clarify, there's only one entry with xmi:id="0". It's this one:

xmlns:type2="http:///org/apache/ctakes/constituency/parser/uima/type.ecore" xmi:version="2.0">

@jcklie I did not find href="#0" in this XMI file.

jcklie commented 4 years ago

It is really difficult to guess what the error could be. Can you put a print statement before the line where it fails and tell me which annotation type it is that breaks, maybe even post this annotation?

dmitriydligach commented 4 years ago

@jcklie Sure, I added a few print statements:

           # Resolve references
            if typesystem.is_collection(fs.type, feature):
                # A collection of references is a list of integers separated
                # by sin`gle spaces, e.g. <foo:bar elements="1 2 3 42" />
                targets = []
                for ref in value.split():
                    target_id = int(ref)

                    if target_id == 0:
                        print('target_id:', target_id)
                        print('value:', value)
                        print('fs:', fs)
                        print('feature_name:', feature_name)
                        print('fs.type:', fs.type)
                        print('feature:', feature)
                        # print('feature_structures:', feature_structures)

Which print the following right before it crashes:

target_id: 0 value: 0 9911 fs: org_apache_ctakes_typesystem_type_relation_CollectionTextRelation(xmiID=12107, members='0 9911', id='0', category=None, discoveryTechnique='0', confidence='0.0', polarity='0', uncertainty='0', conditional='false', type='org.apache.ctakes.typesystem.type.relation.CollectionTextRelation') feature_name: members fs.type: org.apache.ctakes.typesystem.type.relation.CollectionTextRelation feature: Feature(name='members', rangeTypeName='uima.cas.FSList', description='A super-type for relationships between multiple spans of text.', elementType='org.apache.ctakes.typesystem.type.relation.RelationArgument', multipleReferencesAllowed=None, _has_reserved_name=False)

So, I think I found the corresponding annotation from the XMI file:

<relation:CollectionTextRelation xmi:id="12107" id="0" discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0" conditional="false" members="0 9911"/>

Does this help?

reckart commented 4 years ago

Aside from the bug in cassis.... - if I see this right, then the relation has a list feature members with a null value. The latter triggers the reference to xmi:id=0 (the null value feature structure). So maybe your software also has a bug in the first place that this null reference shouldn't even be there (i.e. it should maybe be members="9911")?.

dmitriydligach commented 4 years ago

@reckart Thanks for pointing it out. Most likely this is not an issue with our software -- instead it might be an annotation error (we have a reader that populates these things in the CAS).

jcklie commented 4 years ago

@dmitriydligach I pushed a fix. Can you check whether it works for you in master?

dmitriydligach commented 4 years ago

@jcklie It worked! Thank you so much for addressing this issue so quickly. I will close it.

jcklie commented 4 years ago

I released 0.2.8