Closed GregSilverman closed 5 years ago
Here is the XMI and TypeSystem files: metamap_out.zip
I could remap all the elements, which for now would be fine, but eventually we will be processing about 50 million free text medical record notes, so I'd really like to implement something to handle this.
I'll do the work if you let me know how to handle this exception (the other two annotator engines look similar to BioMedICUS, so they should be fine).
There is one other annotation type that is like this, too, in the MetaMap XMI:
<ts2:Candidate begin="0" concept="Admission date" cui="C1302393" end="14" head="true" matchMap="65 74" overmatch="false" preferred="Date of admission" score="-1000" sofa="1" spans="99" xmi:id="50">
<sources>LNC</sources>
<sources>MTH</sources>
<sources>SNOMEDCT_US</sources>
<semanticTypes>tmco</semanticTypes>
<matchedwords>admission</matchedwords>
<matchedwords>date</matchedwords>
</ts2:Candidate>
Thanks!
The problem is I think that I did not implement array parsing yet. I try to also look into it today.
Great, I thought it was that simple.
@jcklie, any word on this?
If you have a method in mind as how to implement this, please let me know what it is so that I can proceed with it. I saw that you are on conference next week, so I realize you are very busy. Unfortunately, I need this very soon, since we have a pending publication submission deadline, so I will gladly work on implementing this.
The other option would be for me to manually parse out the XML and deal with it, which I would prefer to not do, although I have done it before.
You would need to change the deserialization XML iterparseloop in CasXmiDeserializer:deserialize
to also include start events. Then when you get a start event while you are already looking at an annotation, then it has to be an array element. You accumulate these while you have not yet seen the end of the current annotation, when you get an end event, you can call the self._parse_annotation(typesystem, elem)
and also parse the elements. I think that you would have to accumulate in a dictionary by tag name. I hope that this explanation makes sense. You can always go back to Java.
Yes it does make sense. I was thinking a list of dictionaries would be the way to do it.
I just glanced at the code and while not completely straightforward, it seems doable. I'll have to read up on the etree
xml parse stuff (I've used it before, but it's been a while).
I think it is a dictionary of lists. I also recommend you to add a unit test as early as possible.
Got it. I just looked at the start
and end
events for all the types. This seem quite doable. I can hopefully get this done in the next few days.
This is what I have so far:
test = etree.iterparse(source, events=("start", "end",))
array_elements = dict()
i = 0
for event, elem in test:
#print(dir(elem))
if elem.tag == TAG_XMI:
# Ignore the closing 'xmi:XMI' tag
pass
elif elem.tag == TAG_CAS_NULL:
pass
elif elem.tag == TAG_CAS_SOFA:
sofa = self._parse_sofa(elem)
sofas.append(sofa)
elif elem.tag == TAG_CAS_VIEW:
proto_view = self._parse_view(elem)
views[proto_view.sofa] = proto_view
if elem.tag not in [TAG_XMI, TAG_CAS_NULL, TAG_CAS_SOFA, TAG_CAS_VIEW]:
# assume annotation with array elements has children
if event == "start" and elem.getchildren():
array_elements[elem.tag] = elem.getchildren()
i += 1
# this is an annotation end tag
elif event == "end" and elem.text is None:
if i > 0:
print(array_elements)
i = 0
self._clear_elem(elem)
I'll work on adding the _parse_annotation
bit tomorrow and add some tests.
@jcklie , I was able to grab these, but then ran into a whole slew of problems with other feature structures not being read in. Given that time is of the essence, I am going to go the "back to Java" route, which is rather disappointing, but that's life.
Here's the code that I used. I'm sure there are better ways of identifying annotations, but this did work:
elif elem.tag not in [TAG_XMI, TAG_CAS_NULL, TAG_CAS_SOFA, TAG_CAS_VIEW]:
# assume annotation with array element has children
if event == "start" and elem.getchildren():
assert elem.text != None
#print(elem.getchildren())
array_elements[elem.tag] = elem.getchildren()
has_children = True
# this is an annotation end tag
elif event == "end" and '{' in elem.tag: #and elem.text is None:
if has_children:
print('array:', array_elements)
#annotation = self._parse_annotation(typesystem, array_elements)
has_children = False
else:
print('na:', elem)
# annotation = self._parse_annotation(typesystem, elem)
So, I figured out that using the start
event was somehow screwing up the CAS structure in the XMI when it iterated through it, which is why I was getting spurious results using the above code when I added in the _parse_annotations
method.
So, instead of using start
and end
tags, I decided to use the get parent
, text
and tag
methods to define whether something was an annotation or an child of an annotation, as per:
parent = ''
for event, elem in context
....
else:
# test for parent/child
if elem.text and elem.getparent():
parent = elem.getparent()
print(elem.tag, elem.text, type(parent.tag), parent.tag)
# skip parent/child elements
if elem.text and parent:
pass
else:
annotation = self._parse_annotation(typesystem, elem)
annotations[annotation.xmiID] = annotation
I will now need to figure out what to do with the accumulated children of an annotation, but that hopefully will be a bit more straightforward (append these as nested default dictionaries within the annotation, perhaps?).
Anyway, this is a very enlightening exercise and much more fruitful and enjoyable than what I was trying to get done in Java yesterday (since our types systems did not have the dkpro metadata type, this was leading me down a rabbit hole of errors).
@jcklie, I've now accumulated the features for each annotation with an array, where each annotation with an array of features is added to a dictionary. The next step is to figure out how to deal with the parsing bit.
Anyway, this is what I have so far:
parent = ''
has_parent = False
elem_array = []
ann = {}
for event, elem in context:
...............
else:
# nested array of features
if elem.text and elem.getparent() and '{' not in elem.tag:
parent = elem.getparent()
elem_array.append(elem)
#print('a:', elem.tag, elem.text, type(parent.tag), parent.tag)
has_parent = True
# end of annotation with nested array
if event == "end" and '{' in elem.tag and has_parent:
assert elem.text
#print('test:', elem_array, elem.tag)
# create dictionary of features keyed by parent annotation
ann[elem] = elem_array
# TODO: figure this out!
annotation = self._parse_annotation(typesystem, ann[elem])
annotations[annotation.xmiID] = annotation
# clear
ann.clear()
elem_array.clear()
has_parent = False
# annotation with no nested array
elif event == 'end' and not has_parent:
annotation = self._parse_annotation(typesystem, elem)
annotations[annotation.xmiID] = annotation
If you read this and can shoot me any ideas you have, at your convenience, that would be great. In the mean time, I will just do some experimentation. Much appreciated!
Oh, and I really don't like my test for the end of an annotation using {
, but I can't see any other way to do this.
I'm now not sure what to do after this:
def _parse_annotation(self, typesystem: TypeSystem, elem):
# Strip the http prefix, replace / with ., remove the ecore part
# TODO: Error checking
x = ''
# dictionary or element object
if isinstance(elem, dict):
for key, value in elem.items():
x = key
else:
x = elem
typename = x.tag[9:].replace("/", ".").replace("ecore}", "")
print('TYPE:', typename, x)
I'll play around with this and reassemble the dictionary with the renamed key to see how passing around an annotation as a dictionary behaves.
Have to change the data structure that I am passing to _parse_annotations
due to issue making list from elem
data type. If I do an elem.text
for a feature before appending elem
to a list then it displayed the text, but once I append elem
to a list of features then when I iterate through the list as in
for e in elem_list:
print(e.text)
the value for the text is gone. It must be some issue with storing it in memory. Anyway, I am close to a solution.
So, this is what I ended up doing in the deserialize method:
else:
# nested array of features
if elem.text and elem.getparent() and '{' not in elem.tag:
# add new item to list as they accumulate
elements[elem.tag].append(elem.text)
# set flag for later processing of accumulated data
has_parent = True
# end of annotation with nested array
if event == "end" and '{' in elem.tag and has_parent:
assert elem.text
# key is parent tag, value is feature defaultdict
ann[elem] = elements
annotation = self._parse_annotation(typesystem, ann)
#annotations[annotation.xmiID] = annotation
# clear
elements.clear()
ann.clear()
elem_array.clear()
has_parent = False
# annotation with no nested array
elif event == 'end' and not has_parent:
annotation = self._parse_annotation(typesystem, elem)
annotations[annotation.xmiID] = annotation
and it now is usable in the _parse_annotations
method:
def _parse_annotation(self, typesystem: TypeSystem, elem):
# Strip the http prefix, replace / with ., remove the ecore part
# TODO: Error checking
x = ''
# iterate through dictionary of annotation with feature array
if isinstance(elem, dict):
for key, value in elem.items():
x = key
print('Annotation:', key.tag, key.attrib, value)
for k, v in value.items():
print('Features:', k, v)
# normal annotation
else:
x = elem
# Strip the http prefix, replace / with ., remove the ecore part
typename = x.tag[9:].replace("/", ".").replace("ecore}", "")
AnnotationType = typesystem.get_type(typename)
attributes = dict(x.attrib)
Will continue the experiment to figure out what to do with the array of features...
Yes! Got it! I will clean up the code and add some tests. Then I can submit a pull request with the changes.
Your files work for me now in master. Can this be closed now?
Thanks!
I got the BioMedICUS annotations working with cassis.
Now, on to MetaMap. These have an odd way of representing arrays. For example,
StringArray
is represented asWhen processing the XMI, this throws the error that:
Any suggestions on how to deal with this? Again, the NLP annotator is not in our control, so the data are what they are, for good or bad.
Thanks!