dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
84 stars 22 forks source link

Error parsing FSList in CTAKES xmi #238

Closed aarongiera closed 2 years ago

aarongiera commented 2 years ago

Describe the bug I encountered an error while loading a ctakes xmi file.

To Reproduce I've attached the ctakes TypeSystem.xml file here. Loading this does produce some warnings due to duplicate features, but I think this is normal. I've also attached the xmi. It's just the default clinical pipeline ran on the example note in the ctakes install guide. I think this is the offending xml element:

<textsem:Predicate xmi:id="5931" sofa="1" begin="296" end="304" relations="5950 5960" frameSet="assess.01"/>

Here's the code:

import cassis
import os

ctakes_home = os.getenv("CTAKES_HOME")

typesystem_path = ctakes_home + "/resources/org/apache/ctakes/typesystem/types/TypeSystem.xml"
with open(typesystem_path, 'rb') as f:
    typesystem = cassis.load_typesystem(f)

fname = "/var/data/example-ctakes/example.txt.xmi"

# load cas
with open(fname, 'rb') as f:
    cas = cassis.load_cas_from_xmi(f, typesystem=typesystem)

for sentence in cas.select('org.apache.ctakes.typesystem.type.textspan.Sentence'):
    print('\nsentence:', sentence.get_covered_text())

    for token in cas.select_covered('org.apache.ctakes.typesystem.type.syntax.BaseToken', sentence):
        # for token in cas.select_covered('org.apache.ctakes.typesystem.type.syntax.WordToken', sentence):
        print('token:', token.get_covered_text())

Error message I provided some variables at the break point for debugging.

xmi_id: 5931
feature.name: relations
feature.description: None
feature.domainType.name: org.apache.ctakes.typesystem.type.textsem.Predicate
feature.elementType.name: org.apache.ctakes.typesystem.type.textsem.SemanticRoleRelation
fs.type.name: org.apache.ctakes.typesystem.type.textsem.Predicate
feature.rangeType.name: uima.cas.FSList
value: 5950 5960

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_19328/774756364.py in <module>
     18 # load cas
     19 with open(fname, 'rb') as f:
---> 20     cas = cassis.load_cas_from_xmi(f, typesystem=typesystem)
     21 
     22 view = "_InitialView"

/var/data/conda/envs/ml/lib/python3.7/site-packages/cassis/xmi.py in load_cas_from_xmi(source, typesystem, lenient, trusted)
     83             return deserializer.deserialize(src, typesystem=typesystem, lenient=lenient, trusted=trusted)
     84     else:
---> 85         return deserializer.deserialize(source, typesystem=typesystem, lenient=lenient, trusted=trusted)
     86 
     87 

/var/data/conda/envs/dsp001/lib/python3.7/site-packages/cassis/xmi.py in deserialize(self, source, typesystem, lenient, trusted)
    253                             print("feature.rangeType.name: %s" % feature.rangeType.name)
    254                             print("value: %s" % value)
--> 255                         target_id = int(value)
    256                         fs[feature_name] = feature_structures[target_id]
    257                         referenced_fs.add(target_id)

ValueError: invalid literal for int() with base 10: '5950 5960'

Potential Solution This get's rid of the error, but I'm not sure if this the proper way to fix it. I'm not very familiar with uima yet.

diff --git a/cassis/typesystem.py b/cassis/typesystem.py
index 73de09c..032f7f7 100644
--- a/cassis/typesystem.py
+++ b/cassis/typesystem.py
@@ -29,6 +29,7 @@ TYPE_NAME_LONG = UIMA_CAS_PREFIX + "Long"
 TYPE_NAME_DOUBLE = UIMA_CAS_PREFIX + "Double"
 TYPE_NAME_ARRAY_BASE = UIMA_CAS_PREFIX + "ArrayBase"
 TYPE_NAME_FS_ARRAY = UIMA_CAS_PREFIX + "FSArray"
+TYPE_NAME_FS_LIST = UIMA_CAS_PREFIX + "FSList"
 TYPE_NAME_INTEGER_ARRAY = UIMA_CAS_PREFIX + "IntegerArray"
 TYPE_NAME_FLOAT_ARRAY = UIMA_CAS_PREFIX + "FloatArray"
 TYPE_NAME_STRING_ARRAY = UIMA_CAS_PREFIX + "StringArray"
diff --git a/cassis/xmi.py b/cassis/xmi.py
index 67657cc..267f40d 100644
--- a/cassis/xmi.py
+++ b/cassis/xmi.py
@@ -24,6 +24,7 @@ from cassis.typesystem import (
     TYPE_NAME_FLOAT,
     TYPE_NAME_FLOAT_ARRAY,
     TYPE_NAME_FS_ARRAY,
+    TYPE_NAME_FS_LIST,
     TYPE_NAME_INTEGER,
     TYPE_NAME_INTEGER_ARRAY,
     TYPE_NAME_LONG,
@@ -227,7 +228,7 @@ class CasXmiDeserializer:
                         continue

                     # Resolve references
-                    if fs.type.name == TYPE_NAME_FS_ARRAY or feature.rangeType.name == TYPE_NAME_FS_ARRAY:
+                    if fs.type.name == TYPE_NAME_FS_ARRAY or feature.rangeType.name == TYPE_NAME_FS_ARRAY or feature.rangeType.name == TYPE_NAME_FS_LIST:
                         # An array of references is a list of integers separated
                         # by single spaces, e.g. <foo:bar elements="1 2 3 42" />
                         targets = []
reckart commented 2 years ago

Thanks, that's indeed a case no yet covered by the UIMA/cassis XMI test suite. Looking into it!

reckart commented 2 years ago
<?xml version="1.0" encoding="UTF-8"?>
<xmi:XMI xmlns:noNamespace="http:///uima/noNamespace.ecore" xmlns:tcas="http:///uima/tcas.ecore" xmlns:xmi="http://www.omg.org/XMI" xmlns:cas="http:///uima/cas.ecore" xmi:version="2.0">
    <cas:NULL xmi:id="0"/>
    <cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView"/>
    <tcas:Annotation xmi:id="2" sofa="1" begin="0" end="0"/>
    <tcas:Annotation xmi:id="3" sofa="1" begin="0" end="0"/>
    <cas:View sofa="1" members="6 7"/>

    <!-- We support this case when a list is added directly to an index or when "multipleReferencesAllows=true" -->
    <cas:NonEmptyFSList xmi:id="6" tail="5" head="3"/>
    <cas:NonEmptyFSList xmi:id="5" tail="4" head="2"/>
    <cas:EmptyFSList xmi:id="4"/>

    <!-- This case when "multipleReferencesAllows=false" and the list is not indexed is currently not supported -->
    <noNamespace:FsListHolder xmi:id="7" fsList="3 2"/>
</xmi:XMI>
<?xml version="1.0" encoding="UTF-8"?><xmi:XMI xmlns:noNamespace="http:///uima/noNamespace.ecore" xmlns:tcas="http:///uima/tcas.ecore" xmlns:xmi="http://www.omg.org/XMI" xmlns:cas="http:///uima/cas.ecore" xmi:version="2.0">
    <cas:NULL xmi:id="0"/>
    <cas:NonEmptyFSList xmi:id="6" tail="5" head="3"/>
    <noNamespace:FsListHolder xmi:id="7" fsList="3 2"/>
    <cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView"/>
    <cas:NonEmptyFSList xmi:id="5" tail="4" head="2"/>
    <cas:EmptyFSList xmi:id="4"/>
    <tcas:Annotation xmi:id="2" sofa="1" begin="0" end="0"/>
    <tcas:Annotation xmi:id="3" sofa="1" begin="0" end="0"/>
    <cas:View sofa="1" members="6 7"/>
</xmi:XMI>

I think the suggested fix looks ok.

@jcklie WDYT?

jcklie commented 2 years ago

Looks good to me

reckart commented 2 years ago

Ok, so the proposed fix is not a fix because it stores the "tail" feature of the list as a python list. But what should be done is to decode the array of identifiers into a series of linked up NonEmptyFSList and EmptyFSList instances - at least for the time being.

It would be way more convenient if an FSList (IntegerList, etc. etc) could be handled as a python list/array in python, but I'm not exactly sure how to implement this (yet).