Avoid highly specific matches on general terms

kristinlindquist commented 1 year ago

This is a question and not a feature request or bug report, so let me know if I should put it elsewhere.

Does anyone have any general techniques to prevent a general concept from matching to a highly specific concept? As an example, "high-risk" is matched to the UMLS record for "unsafe sex".

Other examples:

'business combinations': ['Short-acting sulfonamide combinations', 'Salt solution combinations', 'topical antibiotic combinations', 'lung surfactant combinations']
'derivatives': ['Amine and/or amine derivative', 'Caranes', 'Pinanes']

Additionally, any ideas about filtering out generic term matches, even if accurate? I can filter on "types", e.g. to say I am only interested in T121 (Pharmacologic Substance), but it will still match a bunch of terms to "Pharmaceutical Preparations" and the like. I can do this in post-processing, perhaps with some tfidf approach or some "specificity score" based on where the entity sits in the UMLS tree. But I figured I'd ask if anyone has a better way.

dakinggg commented 1 year ago

The approaches you described are exactly what I would try first. Using tfidf or something similar, and using the umls type tree. Filtering to a higher score threshold can also be helpful. Lastly, you could try training a further entity linking model to distinguish.

kristinlindquist commented 1 year ago

Cool thank you @dakinggg. I will go ahead and close this!

allenai / scispacy

Avoid highly specific matches on general terms #483