dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
85 stars 22 forks source link

Broken annotation type #38

Closed GregSilverman closed 5 years ago

GregSilverman commented 5 years ago

Removing uima.tcas.DocumentAnnotation from _types property in the TypeSystem class in typesystem.py breaks load_cas_from_xmi.

This code was removed in a previous commit:

# DocumentAnnotation
t = self.create_type(name='uima.tcas.DocumentAnnotation', supertypeName='uima.tcas.Annotation')
self.add_feature(t, name='language', rangeTypeName='uima.cas.String')
jcklie commented 5 years ago

Can you give me a minimal CAS file that breaks?

GregSilverman commented 5 years ago

This is a full MIMIC (de-identified) CAS and the TypeSystem file for BioMedICUS.

Archive.zip

GregSilverman commented 5 years ago

Hi, I also had to add these types:

t = self.create_type(name='org.apache.uima.examples.SourceDocumentInformation', supertypeName='uima.tcas.Annotation')
self.add_feature(t, name='uri', rangeTypeName='uima.cas.String')
self.add_feature(t, name="offsetInSource", rangeTypeName="uima.cas.Integer")
self.add_feature(t, name="documentSize", rangeTypeName="uima.cas.Integer")
        self.add_feature(t, name="lastSegment", rangeTypeName="uima.cas.Integer")

 t = self.create_type(name='uima.noNamespace.ArtifactID', supertypeName='uima.tcas.Annotation')
self.add_feature(t, name='artifactID', rangeTypeName='uima.cas.Integer')

 t = self.create_type(name='uima.noNamespace.ArtifactMetadata', supertypeName='uima.tcas.Annotation')
self.add_feature(t, name='key', rangeTypeName='uima.cas.String')
self.add_feature(t, name='value', rangeTypeName='uima.cas.String')

and now, I am getting the error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-458c411012c6> in <module>()
     10 # 528715 737-v1
     11 with open(dir_test + '528715.txt.xmi', 'rb') as f:
---> 12     cas = load_cas_from_xmi(f, typesystem=typesystem)

/anaconda3/lib/python3.7/site-packages/cassis/xmi.py in load_cas_from_xmi(source, typesystem)
     36         return deserializer.deserialize(BytesIO(source.encode("utf-8")), typesystem=typesystem)
     37     else:
---> 38         return deserializer.deserialize(source, typesystem=typesystem)
     39 
     40 

/anaconda3/lib/python3.7/site-packages/cassis/xmi.py in deserialize(self, source, typesystem)
     95                 annotation = annotations[member_id]
     96 
---> 97                 view.add_annotation(annotation)
     98 
     99         return cas

/anaconda3/lib/python3.7/site-packages/cassis/cas.py in add_annotation(self, annotation)
    167         annotation.xmiID = self._get_next_xmi_id()
    168         if isinstance(annotation, AnnotationBase):
--> 169             annotation.sofa = self.get_sofa().xmiID
    170 
    171         self._current_view.add_annotation_to_index(annotation)

AttributeError: 'uima_cas_FSArray' object has no attribute 'sofa'

Please advise.

reckart commented 5 years ago

org.apache.uima.examples.SourceDocumentInformation

The type system file in your archive does not declare this type - consequently cassis cannot know it. Note that this is also not a built-in UIMA type. So in order to have a complete type system definition, you need to add it to your type system definition XML file.

ArtifactID and ArtifactMetadata

These are "special" because they do not have a namespace/package declaration. This actually not a good idea because it means that these classes cannot be used with the UIMA JCas interface. I would recommend that you move these into a proper namespace/package. See e.g. https://stackoverflow.com/a/283828/2511197

'uima_cas_FSArray' object has no attribute 'sofa'

FSArray inherits directly from TOP, not from AnnotationBase - it has no sofa feature. If cassis believes that FSArray inherits from AnnotationBase, it would seem to be a bug.

GregSilverman commented 5 years ago

@reckart Thanks for the reply. I just added the missing type as per your recommendation.

Regarding the types with no namespace, I have no control over the source code for these - we're using 4-different NLP annotators to compare the system annotations for various tasks. We just need to extract specific annotations.

And, regarding the last error, yes, there should be no sofa feature, which is why the error seemed strange. So, it would seem to be a bug.

jcklie commented 5 years ago

@GregSilverman Thank you for the report and using dkpro-cassis! I will look into the errors tomorrow and then report back.

GregSilverman commented 5 years ago

@Rentier, it seems very promising for our use case. And while I have background in JVM languages, I have become very lazy and would prefer to stay within the python ecosystem.

GregSilverman commented 5 years ago

Figured out the issue with FSArray and the sofa object: These clinical NLP pipelines define FSArray a bit different (they also have start and end features for annotated data). I am probably going to fork this project going forward and put all changes there.

reckart commented 5 years ago

@GregSilverman uima.cas.FSArray is a feature and inheritance final type in UIMA. It can neither be subclassed nor can additional features be added to it. It seems that something is very odd in that data you have.

reckart commented 5 years ago

@GregSilverman Can you give us a pointer to the source of the data which uses FSArray with begin/end/sofa features?

reckart commented 5 years ago

@GregSilverman In your XMI file, I also don't see FSArrays having begin/end/sofa features:

<cas:FSArray xmi:id="40465" elements="40456"/>

The elements of the array might be annotations and have begin/end/sofa, but not the FSArray itself.

So I'd say, it is more likely a bug in cassis that FSArrays are not properly interpreted if you see FSArray problems with the file you provided.

GregSilverman commented 5 years ago

@reckart, unless I'm interpreting this incorrectly, in the TypeSystem file I sent, there are several annotation types, such as this that have a rangeType of FSArray:

        <typeDescription>
            <name>biomedicus.v2.Historical</name>
            <description>Automatically generated type from edu.umn.biomedicus.modification.Historical</description>
            <supertypeName>uima.tcas.Annotation</supertypeName>
            <features>
                <featureDescription>
                    <name>cueTerms</name>
                    <description>Automatically generated feature</description>
                    <rangeTypeName>uima.cas.FSArray</rangeTypeName>
                    <elementType>biomedicus.v2.ModificationCue</elementType>
                </featureDescription>
            </features>
        </typeDescription>

that have start and end features in the XMI file. So, I changed the supertype for FSArray to uima.tcas.Annotation.

GregSilverman commented 5 years ago

As for sofa, a simple search in the XMI within this tag in the XMI file: <cas:View members="8 13 15" sofa="1"/> for the id you have above the FSArray -> 40465 has this in the list.

Again, I may be interpreting this incorrectly, but since this has a sofa feature and since the id for FSArray is in the XML tag, I assumed there was some inheritance going on.

Disclaimer: I am fairly new to UIMA, so be kind to me! ;-)

reckart commented 5 years ago

@GregSilverman in your example biomedicus.v2.Historical has a feature with the range type FSArray and the elements of that array are biomedicus.v2.ModificationCue.

In a programming languge, one might write that approximately as:

package biomedicus.v2;

class ModificationCue extends Annotation {
}

class Historical extends Annotation {
  FSArray<ModificationCue> cueTerms;
}

So here we have two annotation types: Historical and ModificationCue. The FSArray type itself is a built-in UIMA type which does not inherit from Annotation - it could be liked to types such as List or Array in programming languages. FSArray inherits from TOP which is the root of the UIMA type hierarchy - that is roughly comparable to Object in some programming languages.

reckart commented 5 years ago

Again, I may be interpreting this incorrectly, but since this has a sofa feature and since the id for FSArray is in the XML tag, I assumed there was some inheritance going on.

What you see there is a reference from one feature structure to another. The type inheritance hierarchy cannot be determined by looking at the XMI file - you need to look at the type system descriptor XML file for the inheritance.

jcklie commented 5 years ago

@GregSilverman I see three bugs in this issue, I try to address them one by one (I edited your first post for keeping track).

jcklie commented 5 years ago

@GregSilverman For me, the CAS you posted here loads with the most recent master. Can this issue then be closed?

GregSilverman commented 5 years ago

Yes, definitely. My local fix to the version installed via pip works and I can grab the latest commit of master later. Thanks!

GregSilverman commented 5 years ago

@GregSilverman in your example biomedicus.v2.Historical has a feature with the range type FSArray and the elements of that array are biomedicus.v2.ModificationCue.

In a programming languge, one might write that approximately as:

package biomedicus.v2;

class ModificationCue extends Annotation {
}

class Historical extends Annotation {
  FSArray<ModificationCue> cueTerms;
}

So here we have two annotation types: Historical and ModificationCue. The FSArray type itself is a built-in UIMA type which does not inherit from Annotation - it could be liked to types such as List or Array in programming languages. FSArray inherits from TOP which is the root of the UIMA type hierarchy - that is roughly comparable to Object in some programming languages.

@reckart, yes, I figured that out later after I posted this last night.

GregSilverman commented 5 years ago

Hi there!

Revisiting this, I grabbed the latest commit and am trying to get this working on the enclosed CAS and type system file.

When I run this:

from cassis import *

typesystem_dir = <path to typesystem file>
dir_test = <path to XMI>

with open(typesystem_dir + 'TypeSystem.xml', 'rb') as f:
    typesystem = load_typesystem(f)

# add missing types
t = typesystem.create_type(name='org.apache.uima.examples.SourceDocumentInformation', supertypeName='uima.tcas.Annotation')
typesystem.add_feature(t, name='uri', rangeTypeName='uima.cas.String')
typesystem.add_feature(t, name="offsetInSource", rangeTypeName="uima.cas.Integer")
typesystem.add_feature(t, name="documentSize", rangeTypeName="uima.cas.Integer")
typesystem.add_feature(t, name="lastSegment", rangeTypeName="uima.cas.Integer")

t = typesystem.create_type(name="uima.tcas.DocumentAnnotation", supertypeName="uima.tcas.Annotation")
typesystem.add_feature(t, name="language", rangeTypeName="uima.cas.String")

t = typesystem.create_type(name='uima.noNamespace.ArtifactID', supertypeName='uima.tcas.Annotation')
typesystem.add_feature(t, name='artifactID', rangeTypeName='uima.cas.Integer')

t = typesystem.create_type(name='uima.noNamespace.ArtifactMetadata', supertypeName='uima.tcas.Annotation')
typesystem.add_feature(t, name='key', rangeTypeName='uima.cas.String')
typesystem.add_feature(t, name='value', rangeTypeName='uima.cas.String')

fname = '0313-v1.txt.xmi'
with open(dir_test + fname, 'rb') as f:
    cas = load_cas_from_xmi(f, typesystem=typesystem)

view = cas.get_view('_InitialView')

print([x for x in view.select_all()])

I get the error AttributeError: 'org_apache_ctakes_typesystem_type_structured_DocumentID' object has no attribute 'sofa' similar to the one above.

I thought this had been fixed as per this issue?

Thanks!

ctakes_example_error_sofa.zip

jcklie commented 5 years ago

@GregSilverman For me, I did not get the error you describe in master. I encountered an edge case due to your type system redefining a feature called ontologyConceptArr, I think that you should not redefine that. I added code to handle this case, it is already in master. With the latest master, your file loads for me.

GregSilverman commented 5 years ago

Thanks for looking at this @jcklie. Interesting. Perhaps the fact that I installed this using the pypi version was the issue? Anyway, I will try it later.

jcklie commented 5 years ago

@GregSilverman Yes, it is not on pypi but in cassis master. I am waiting for your go that I can release the next version, as I do not want to release something that is broken for you (again).

GregSilverman commented 5 years ago

Got it... I'll have need again very soon for deserializing some more CAS objects. For this last one just now, I just used the changes I had previously made locally, since I have a self-imposed deadline.

jcklie commented 5 years ago

For me this works in master and 0.2.0-rc1. I will close this now. Please open a new issue if this error still persists.