dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
85 stars 22 forks source link

Implementing support for array features #39

Closed GregSilverman closed 5 years ago

GregSilverman commented 5 years ago

I got the BioMedICUS annotations working with cassis.

Now, on to MetaMap. These have an odd way of representing arrays. For example, StringArray is represented as

        <cas:StringArray xmi:id="1373">
        <elements>CSP</elements>
        <elements>LCH</elements>
        <elements>LCH_NW</elements>
        <elements>LNC</elements>
        <elements>MSH</elements>
        <elements>MTH</elements>
        <elements>NCI</elements>
        <elements>NCI_CDISC</elements>
        <elements>NCI_FDA</elements>
        <elements>NCI_NICHD</elements>
        <elements>SNMI</elements>
        <elements>SNOMEDCT_US</elements>
    </cas:StringArray>

When processing the XMI, this throws the error that:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-1-94fafef6801b> in <module>()
     20 # 528715 737-v1
     21 with open(dir_test + '737-v1.txt.xmi', 'rb') as f:
---> 22     cas = load_cas_from_xmi(f, typesystem=typesystem)
     23     #print(cas.sofas)
     24     #print(dir(cas))

/anaconda3/lib/python3.7/site-packages/cassis/xmi.py in load_cas_from_xmi(source, typesystem)
     36         return deserializer.deserialize(BytesIO(source.encode("utf-8")), typesystem=typesystem)
     37     else:
---> 38         return deserializer.deserialize(source, typesystem=typesystem)
     39 
     40 

/anaconda3/lib/python3.7/site-packages/cassis/xmi.py in deserialize(self, source, typesystem)
     71                 views[proto_view.sofa] = proto_view
     72             else:
---> 73                 annotation = self._parse_annotation(typesystem, elem)
     74                 annotations[annotation.xmiID] = annotation
     75 

/anaconda3/lib/python3.7/site-packages/cassis/xmi.py in _parse_annotation(self, typesystem, elem)
    120         typename = elem.tag[9:].replace("/", ".").replace("ecore}", "")
    121 
--> 122         AnnotationType = typesystem.get_type(typename)
    123         attributes = dict(elem.attrib)
    124 

/anaconda3/lib/python3.7/site-packages/cassis/typesystem.py in get_type(self, typename)
    336             return self._types[typename]
    337         else:
--> 338             raise Exception("Type with name [{0}] not found!".format(typename))
    339 
    340     def get_types(self) -> Iterator[Type]:

Exception: Type with name [] not found!

Any suggestions on how to deal with this? Again, the NLP annotator is not in our control, so the data are what they are, for good or bad.

Thanks!

GregSilverman commented 5 years ago

Here is the XMI and TypeSystem files: metamap_out.zip

GregSilverman commented 5 years ago

I could remap all the elements, which for now would be fine, but eventually we will be processing about 50 million free text medical record notes, so I'd really like to implement something to handle this.

I'll do the work if you let me know how to handle this exception (the other two annotator engines look similar to BioMedICUS, so they should be fine).

There is one other annotation type that is like this, too, in the MetaMap XMI:

        <ts2:Candidate begin="0" concept="Admission date" cui="C1302393" end="14" head="true"  matchMap="65 74" overmatch="false" preferred="Date of admission" score="-1000" sofa="1" spans="99" xmi:id="50">
        <sources>LNC</sources>
        <sources>MTH</sources>
        <sources>SNOMEDCT_US</sources>
        <semanticTypes>tmco</semanticTypes>
        <matchedwords>admission</matchedwords>
        <matchedwords>date</matchedwords>
    </ts2:Candidate>

Thanks!

jcklie commented 5 years ago

The problem is I think that I did not implement array parsing yet. I try to also look into it today.

GregSilverman commented 5 years ago

Great, I thought it was that simple.

GregSilverman commented 5 years ago

@jcklie, any word on this?

If you have a method in mind as how to implement this, please let me know what it is so that I can proceed with it. I saw that you are on conference next week, so I realize you are very busy. Unfortunately, I need this very soon, since we have a pending publication submission deadline, so I will gladly work on implementing this.

The other option would be for me to manually parse out the XML and deal with it, which I would prefer to not do, although I have done it before.

jcklie commented 5 years ago

You would need to change the deserialization XML iterparseloop in CasXmiDeserializer:deserialize to also include start events. Then when you get a start event while you are already looking at an annotation, then it has to be an array element. You accumulate these while you have not yet seen the end of the current annotation, when you get an end event, you can call the self._parse_annotation(typesystem, elem) and also parse the elements. I think that you would have to accumulate in a dictionary by tag name. I hope that this explanation makes sense. You can always go back to Java.

GregSilverman commented 5 years ago

Yes it does make sense. I was thinking a list of dictionaries would be the way to do it.

I just glanced at the code and while not completely straightforward, it seems doable. I'll have to read up on the etree xml parse stuff (I've used it before, but it's been a while).

jcklie commented 5 years ago

I think it is a dictionary of lists. I also recommend you to add a unit test as early as possible.

GregSilverman commented 5 years ago

Got it. I just looked at the start and end events for all the types. This seem quite doable. I can hopefully get this done in the next few days.

GregSilverman commented 5 years ago

This is what I have so far:

test = etree.iterparse(source, events=("start", "end",))
        array_elements = dict()
        i = 0
        for event, elem in test:
            #print(dir(elem))
            if elem.tag == TAG_XMI:
                # Ignore the closing 'xmi:XMI' tag
                pass
            elif elem.tag == TAG_CAS_NULL:
                pass
            elif elem.tag == TAG_CAS_SOFA:
                sofa = self._parse_sofa(elem)
                sofas.append(sofa)
            elif elem.tag == TAG_CAS_VIEW:
                proto_view = self._parse_view(elem)
                views[proto_view.sofa] = proto_view
            if elem.tag not in [TAG_XMI, TAG_CAS_NULL, TAG_CAS_SOFA, TAG_CAS_VIEW]:
                # assume annotation with array elements has children
                if event == "start" and elem.getchildren():
                    array_elements[elem.tag] = elem.getchildren()
                    i += 1
                # this is an annotation end tag 
                elif event == "end" and elem.text is None:
                    if i > 0:
                        print(array_elements)
                    i = 0

            self._clear_elem(elem)

I'll work on adding the _parse_annotation bit tomorrow and add some tests.

GregSilverman commented 5 years ago

@jcklie , I was able to grab these, but then ran into a whole slew of problems with other feature structures not being read in. Given that time is of the essence, I am going to go the "back to Java" route, which is rather disappointing, but that's life.

Here's the code that I used. I'm sure there are better ways of identifying annotations, but this did work:


elif elem.tag not in [TAG_XMI, TAG_CAS_NULL, TAG_CAS_SOFA, TAG_CAS_VIEW]:
    # assume annotation with array element has children
    if event == "start" and elem.getchildren():
        assert elem.text != None
        #print(elem.getchildren())
        array_elements[elem.tag] = elem.getchildren()
        has_children = True
        # this is an annotation end tag 
        elif event == "end" and '{' in elem.tag: #and elem.text is None:

            if has_children:
                print('array:', array_elements)
                #annotation = self._parse_annotation(typesystem, array_elements)
                has_children = False
            else:
                print('na:', elem)
                # annotation = self._parse_annotation(typesystem, elem)
GregSilverman commented 5 years ago

So, I figured out that using the start event was somehow screwing up the CAS structure in the XMI when it iterated through it, which is why I was getting spurious results using the above code when I added in the _parse_annotations method.

So, instead of using start and end tags, I decided to use the get parent, text and tag methods to define whether something was an annotation or an child of an annotation, as per:

parent = ''
for event, elem in context
....

    else:
        # test for parent/child
        if elem.text and elem.getparent():
            parent = elem.getparent()
            print(elem.tag, elem.text, type(parent.tag), parent.tag)
        # skip parent/child elements 
        if elem.text and parent:
            pass
        else:
            annotation = self._parse_annotation(typesystem, elem)
            annotations[annotation.xmiID] = annotation

I will now need to figure out what to do with the accumulated children of an annotation, but that hopefully will be a bit more straightforward (append these as nested default dictionaries within the annotation, perhaps?).

Anyway, this is a very enlightening exercise and much more fruitful and enjoyable than what I was trying to get done in Java yesterday (since our types systems did not have the dkpro metadata type, this was leading me down a rabbit hole of errors).

GregSilverman commented 5 years ago

@jcklie, I've now accumulated the features for each annotation with an array, where each annotation with an array of features is added to a dictionary. The next step is to figure out how to deal with the parsing bit.

Anyway, this is what I have so far:

parent = ''
has_parent = False
elem_array = []
ann = {}

for event, elem in context:

...............

    else:

        # nested array of features 
    if elem.text and elem.getparent() and '{' not in elem.tag:
        parent = elem.getparent()
        elem_array.append(elem)

        #print('a:', elem.tag, elem.text, type(parent.tag), parent.tag)
        has_parent = True

        # end of annotation with nested array
    if event == "end" and '{' in elem.tag and has_parent:
                assert elem.text
        #print('test:', elem_array, elem.tag)

                 # create dictionary of features keyed by parent annotation
                    ann[elem] = elem_array
                    # TODO: figure this out!
                    annotation = self._parse_annotation(typesystem, ann[elem])
                    annotations[annotation.xmiID] = annotation

                    # clear
                    ann.clear()
                    elem_array.clear()
                    has_parent = False

        # annotation with no nested array
    elif event == 'end' and not has_parent:
        annotation = self._parse_annotation(typesystem, elem)
        annotations[annotation.xmiID] = annotation

If you read this and can shoot me any ideas you have, at your convenience, that would be great. In the mean time, I will just do some experimentation. Much appreciated!

GregSilverman commented 5 years ago

Oh, and I really don't like my test for the end of an annotation using {, but I can't see any other way to do this.

GregSilverman commented 5 years ago

I'm now not sure what to do after this:

  def _parse_annotation(self, typesystem: TypeSystem, elem):
        # Strip the http prefix, replace / with ., remove the ecore part
        # TODO: Error checking

        x = ''

        # dictionary or element object
        if isinstance(elem, dict):
            for key, value in elem.items():
                x = key
        else:
            x = elem

        typename = x.tag[9:].replace("/", ".").replace("ecore}", "")
        print('TYPE:', typename, x)

I'll play around with this and reassemble the dictionary with the renamed key to see how passing around an annotation as a dictionary behaves.

GregSilverman commented 5 years ago

Have to change the data structure that I am passing to _parse_annotations due to issue making list from elem data type. If I do an elem.text for a feature before appending elem to a list then it displayed the text, but once I append elem to a list of features then when I iterate through the list as in

for e in elem_list:
    print(e.text)

the value for the text is gone. It must be some issue with storing it in memory. Anyway, I am close to a solution.

GregSilverman commented 5 years ago

So, this is what I ended up doing in the deserialize method:

else:

    # nested array of features
    if elem.text and elem.getparent() and '{' not in elem.tag:

        # add new item to list as they accumulate
        elements[elem.tag].append(elem.text)

        # set flag for later processing of accumulated data
        has_parent = True

    # end of annotation with nested array
    if event == "end" and '{' in elem.tag and has_parent:
        assert elem.text

        # key is parent tag, value is feature defaultdict
        ann[elem] = elements

        annotation = self._parse_annotation(typesystem, ann)
        #annotations[annotation.xmiID] = annotation

        # clear
        elements.clear()
        ann.clear()
        elem_array.clear()
        has_parent = False

    # annotation with no nested array
    elif event == 'end' and not has_parent:

        annotation = self._parse_annotation(typesystem, elem)
        annotations[annotation.xmiID] = annotation

and it now is usable in the _parse_annotations method:

def _parse_annotation(self, typesystem: TypeSystem, elem):
        # Strip the http prefix, replace / with ., remove the ecore part
        # TODO: Error checking

        x = ''

        # iterate through dictionary of annotation with feature array
        if isinstance(elem, dict):
            for key, value in elem.items():
                x = key
                print('Annotation:', key.tag, key.attrib, value)
                for k, v in value.items():
                    print('Features:', k,  v)

         # normal annotation
         else:
            x = elem

        # Strip the http prefix, replace / with ., remove the ecore part
        typename = x.tag[9:].replace("/", ".").replace("ecore}", "")

        AnnotationType = typesystem.get_type(typename)
        attributes = dict(x.attrib)

Will continue the experiment to figure out what to do with the array of features...

GregSilverman commented 5 years ago

Yes! Got it! I will clean up the code and add some tests. Then I can submit a pull request with the changes.

jcklie commented 5 years ago

Your files work for me now in master. Can this be closed now?

GregSilverman commented 5 years ago

Thanks!