LHNCBC / metamaplite

A near real-time named-entity recognizer
https://metamap.nlm.nih.gov/MetaMapLite.shtml
Other
58 stars 14 forks source link

EntityLookup4 not checking for phrase type as well as PoS #29

Closed stevenbedrick closed 1 year ago

stevenbedrick commented 1 year ago

While investigating a difference in the behavior of EntityLookup4 vs EntityLookup5, I ultimately traced it back to the part of findLongestMatch() that checks to see whether the part of speech of the first token of tokenSubList is in allowedPartOfSpeechSet: https://github.com/lhncbc/metamaplite/blob/d3171e5d1deb2ceeeeeca9b757a85b8617e5c01b/src/main/java/gov/nih/nlm/nls/metamap/lite/EntityLookup5.java#LL393C12-L393C12

In EntityLookup5, the check will allow tokens that are not of an allowed PoS if the phrase under consideration is of a type listed in allowedPhraseTypeSet. The corresponding place in EntityLookup4 doesn't do this check, so certain things are getting bounced out from EntityLookup4 that EntityLookup5 allows: https://github.com/lhncbc/metamaplite/blob/d3171e5d1deb2ceeeeeca9b757a85b8617e5c01b/src/main/java/gov/nih/nlm/nls/metamap/lite/EntityLookup4.java#L352

Is there a reason for this? If not, I will add in the corresponding check to EntityLookup4 and do a PR.

willjrogers commented 1 year ago

Originally, only the first token was checked to prevent looking up verbs and determiners at the start of term. I was wondering which useful terms are being rejected using the current conditional test? Would adding other parts of speech to the allowed Part Of Speech Set give you the desired behavior?

stevenbedrick commented 1 year ago

That makes a lot of sense! In my case it's verbs like "walk" and "delayed" which, Because Reasons, are terms of interest in our vocabulary. So adding VB and VBD to the list of allowed PoS tags does solve the problem. Mostly I was just surprised to see the difference in behavior between EntityLookup4 and 5. My impression (possibly incorrect!) is that the idea was that they were supposed to do basically the same thing, but without scoring in the case of 4?

willjrogers commented 1 year ago

EntityLookup5 was designed to replicate the behavior of MetaMap’s minimal commitment parser (PhraseX) using OpenNLP’s noun parser. This was used primarily to support the output of MMI (MetaMap Indexing) format. The scoring was used for ranking MMI format output. The output format was intended as input for the Medical Text Indexer (MTI).

See also:

From: Steven Bedrick @.> Date: Wednesday, May 10, 2023 at 7:58 PM To: lhncbc/metamaplite @.> Cc: Rogers, Willie (NIH/NLM/LHC) [C] @.>, Comment @.> Subject: [EXTERNAL] Re: [lhncbc/metamaplite] EntityLookup4 not checking for phrase type as well as PoS (Issue #29)

That makes a lot of sense! In my case it's verbs like "walk" and "delayed" which, Because Reasons, are terms of interest in our vocabulary. So adding VB and VBD to the list of allowed PoS tags does solve the problem. Mostly I was just surprised to see the difference in behavior between EntityLookup4 and 5. My impression (possibly incorrect!) is that the idea was that they were supposed to do basically the same thing, but without scoring in the case of 4?

— Reply to this email directly, view it on GitHubhttps://github.com/lhncbc/metamaplite/issues/29#issuecomment-1542949050, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHOVZBL43U77RNN5EDE5MXDXFQTRTANCNFSM6AAAAAAX5BZO4E. You are receiving this because you commented.Message ID: @.***> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and are confident the content is safe.

stevenbedrick commented 1 year ago

Aha! That certainly explains why EntityLookup4 isn't paying attention to the phrase type. 🤦‍♂️