LHNCBC / metamaplite

A near real-time named-entity recognizer
https://metamap.nlm.nih.gov/MetaMapLite.shtml
Other
58 stars 14 forks source link

Overriding the equality check defined in PosTokenImpl for ERTokenImpl. #27

Closed stevenbedrick closed 1 year ago

stevenbedrick commented 1 year ago

While investigating a matching bug, I believe I found an issue with the equals() implementation of PosTokenImpl that was breaking things. Consider the following input text:

I saw your patient George H. Ataxia.

The word "Ataxia" is not being caught, because of the period after "H". Removing the period at the end of the sentence causes things to work as expected, and "Ataxia" is caught as an entity. The issue here lies with the fact that EntityLookup5.mapToTokenList's behavior depends on the indexOf() method to align the last token of a processed list of tokens with the corresponding index of the original list of tokens. indexOf(), in turn, depends on equals(), and under the current behavior this means PosTokenImpl.equals(), which only considers the text of the token. In this input, that is a period, and so it is mistakenly choosing the first period in the input sentence to be the new last token, essentially truncating the input part-way through the sentence.

This PR adds a new equality check method to ERTokenImpl that uses all elements of the token to decide equality. I have verified that it solves this issue, and from what I can tell it doesn't introduce any new problems- but I am not fully confident of that fact. I did a quick look around to see where else we were using indexOf etc. on ERToken instances, and as best as I can tell this shouldn't have broken anything- but obviously there's a lot I don't know about the codebase, so feel free to reject this PR if you've got different way you'd like to fix this problem. Thanks!