LHNCBC / metamaplite

A near real-time named-entity recognizer
https://metamap.nlm.nih.gov/MetaMapLite.shtml
Other
55 stars 14 forks source link

Fixed sentence offset errors #5

Closed amadanmath closed 5 years ago

amadanmath commented 5 years ago

The code currently assumes sentDetect splits text into contiguous segments. This is not guaranteed. In particular, it will skip over extra whitespace. For instance,

Sentence one. Sentence two.
Sentence three.  
Sentence four.  Sentence five.

The first three sentences are offset correctly. However, extra spaces at the end of the line after three., and extra space after four. will not be included in the results of sentDetect, which results in offsets for sentences four and five to be incorrect. As a consequence, for example, since brat checks spans vs their contents, extra space in input files results in unusable (erroneous) .ann files.

Instead, I suggest using sentPosDetect, which returns exact spans where the sentences were found.