AnthonyMRios / pymetamap

Python wraper for MetaMap
170 stars 61 forks source link

position returned by pymetamap #47

Open ShoRit opened 4 years ago

ShoRit commented 4 years ago

Hi Anthony,

Firstly, thank you for the wonderful implementation of metamap.

However, I was running into some issues while extracting the keywords using pymetamap.

For example, in the sentence itself "John had a huge heart-attack", could you please direct me to how to extract the exact position of the keyword identified by pymetamap. It shows position = 17:12, but in several cases, I see the exact character position is off by 1-2 characters.

Could you provide some insight into this?

AnthonyMRios commented 4 years ago

Hi ShoRit,

Can you provide some examples? This is going to be an issue with MetaMap, not with the wrapper. But, if you share an example, I can look into it.

yuliaoh commented 4 years ago

Metamap positions are not 0-indexed, that must be why it appears off

kaushikacharya commented 4 years ago

Metamap positions are not 0-indexed, that must be why it appears off

@ShoRit @yuliaoh My understanding is that its 0-indexed. MMI output documentation quotes

Positional Information – Bar separated list of positional information doubles showing StartPos, colon (:), and Length of each trigger identified in the Trigger Information field. StartPos begins at position zero (0) of the input text.

Here's the output using MetaMap 2020 release version:

echo "heart attack" | ./public_mm/bin/metamap -N -Q 4 -y --sldi outputs USER|MMI|5.18|Myocardial Infarction|C0027051|[dsyn]|["HEART ATTACK"-tx-1-"heart attack"-noun-0]|TX|0/12|

Another example: echo "John had a huge heart attack" | ./public_mm/bin/metamap -N -Q 4 -y --sldi outputs USER|MMI|3.75|Myocardial Infarction|C0027051|[dsyn]|["HEART ATTACK"-tx-1-"heart attack"-noun-0]|TX|16/12|

As you can see, its 0-indexed. I have passed the same input arguments as used by pymetamap.

Then why does pymetamap output is 1-indexed?

Its the way pymetamap passes input text which is the reason it appears to be 1-indexed. Taking the example mentioned in pymetamap Readme:

In [3]: sents = ['Heart Attack', 'John had a huge heart attack']

In [4]: concepts,error = mm.extract_concepts(sents,[1,2])

In [5]: for concept in concepts:
   ...:     print(concept)
   ...:
ConceptMMI(index='1', mm='MMI', score='5.18', preferred_name='Myocardial Infarction', cui='C0027051', semtypes='[dsyn]', trigger='["HEART ATTACK"-tx-1-"Heart Attack"-noun-0]', location='TX', pos_info='1/12', tree_codes='')
ConceptMMI(index='2', mm='MMI', score='3.75', preferred_name='Myocardial Infarction', cui='C0027051', semtypes='[dsyn]', trigger='["HEART ATTACK"-tx-1-"heart attack"-noun-0]', location='TX', pos_info='17/12', tree_codes='')

Looking into the code why it appears to become 1-indexed in pymetamap's output:

https://github.com/AnthonyMRios/pymetamap/blob/master/pymetamap/SubprocessBackend.py#L174

                        if input_text is None:
                            input_text = '{0!r}|{1!r}\n'.format(identifier, sentence).encode('utf8')
                        else:
                            input_text += '{0!r}|{1!r}\n'.format(identifier, sentence).encode('utf8')

Have a look at the difference between the two strings:

In [12]: '{0!r}'.format('Heart Attack')
Out[12]: "'Heart Attack'"

In [13]: '{0}'.format('Heart Attack')
Out[13]: 'Heart Attack'

This has been nicely explained by mgilson in https://stackoverflow.com/a/38418132/282155 using example as well as the python documentation.

AnthonyMRios commented 4 years ago

Nice catch @kaushikacharya. I will look into creating a fix for this, it seems reasonably easy.