Open ShoRit opened 4 years ago
Hi ShoRit,
Can you provide some examples? This is going to be an issue with MetaMap, not with the wrapper. But, if you share an example, I can look into it.
Metamap positions are not 0-indexed, that must be why it appears off
Metamap positions are not 0-indexed, that must be why it appears off
@ShoRit @yuliaoh My understanding is that its 0-indexed. MMI output documentation quotes
Positional Information – Bar separated list of positional information doubles showing StartPos, colon (:), and Length of each trigger identified in the Trigger Information field. StartPos begins at position zero (0) of the input text.
Here's the output using MetaMap 2020 release version:
echo "heart attack" | ./public_mm/bin/metamap -N -Q 4 -y --sldi
outputs
USER|MMI|5.18|Myocardial Infarction|C0027051|[dsyn]|["HEART ATTACK"-tx-1-"heart attack"-noun-0]|TX|0/12|
Another example:
echo "John had a huge heart attack" | ./public_mm/bin/metamap -N -Q 4 -y --sldi
outputs
USER|MMI|3.75|Myocardial Infarction|C0027051|[dsyn]|["HEART ATTACK"-tx-1-"heart attack"-noun-0]|TX|16/12|
As you can see, its 0-indexed. I have passed the same input arguments as used by pymetamap.
Its the way pymetamap passes input text which is the reason it appears to be 1-indexed. Taking the example mentioned in pymetamap Readme:
In [3]: sents = ['Heart Attack', 'John had a huge heart attack']
In [4]: concepts,error = mm.extract_concepts(sents,[1,2])
In [5]: for concept in concepts:
...: print(concept)
...:
ConceptMMI(index='1', mm='MMI', score='5.18', preferred_name='Myocardial Infarction', cui='C0027051', semtypes='[dsyn]', trigger='["HEART ATTACK"-tx-1-"Heart Attack"-noun-0]', location='TX', pos_info='1/12', tree_codes='')
ConceptMMI(index='2', mm='MMI', score='3.75', preferred_name='Myocardial Infarction', cui='C0027051', semtypes='[dsyn]', trigger='["HEART ATTACK"-tx-1-"heart attack"-noun-0]', location='TX', pos_info='17/12', tree_codes='')
Looking into the code why it appears to become 1-indexed in pymetamap's output:
https://github.com/AnthonyMRios/pymetamap/blob/master/pymetamap/SubprocessBackend.py#L174
if input_text is None:
input_text = '{0!r}|{1!r}\n'.format(identifier, sentence).encode('utf8')
else:
input_text += '{0!r}|{1!r}\n'.format(identifier, sentence).encode('utf8')
Have a look at the difference between the two strings:
In [12]: '{0!r}'.format('Heart Attack')
Out[12]: "'Heart Attack'"
In [13]: '{0}'.format('Heart Attack')
Out[13]: 'Heart Attack'
This has been nicely explained by mgilson in https://stackoverflow.com/a/38418132/282155 using example as well as the python documentation.
Nice catch @kaushikacharya. I will look into creating a fix for this, it seems reasonably easy.
Hi Anthony,
Firstly, thank you for the wonderful implementation of metamap.
However, I was running into some issues while extracting the keywords using pymetamap.
For example, in the sentence itself "John had a huge heart-attack", could you please direct me to how to extract the exact position of the keyword identified by pymetamap. It shows position = 17:12, but in several cases, I see the exact character position is off by 1-2 characters.
Could you provide some insight into this?