LHNCBC / metamaplite

A near real-time named-entity recognizer
https://metamap.nlm.nih.gov/MetaMapLite.shtml
Other
58 stars 14 forks source link

User-configurable non-alphanumeric characters #34

Closed stevenbedrick closed 1 year ago

stevenbedrick commented 1 year ago

Added a pair of configuration keys to EntityLookup5 to allow user-configurable non-alphanumeric characters in candidate matches.

The motivation is to be able to match certain abbreviations such as "%ile". Current behavior rejects any candidate matches if the first character of the first token is non-alphanumeric (or non-greek-letter), and while this is basically always what one wants there are a handful of particular abbreviations/jargon terms I need to be able to match that start with punctuation. The default behavior should be unchanged, but if metamaplite.entitylookup5.considerNonAlphaTokens is true, it will also allow tokens whose first character is in the list specified by metamaplite.entitylookup5.additionalAllowedFirstChars.

In terms of implementation, since this is very much on a hot performance path, I tried to structure things so that the most common case is likely to fail early.