allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.66k stars 223 forks source link

Avoid highly specific matches on general terms #483

Closed kristinlindquist closed 1 year ago

kristinlindquist commented 1 year ago

This is a question and not a feature request or bug report, so let me know if I should put it elsewhere.

Does anyone have any general techniques to prevent a general concept from matching to a highly specific concept? As an example, "high-risk" is matched to the UMLS record for "unsafe sex".

Screenshot 2023-05-24 at 8 25 42 AM

Other examples:

Additionally, any ideas about filtering out generic term matches, even if accurate? I can filter on "types", e.g. to say I am only interested in T121 (Pharmacologic Substance), but it will still match a bunch of terms to "Pharmaceutical Preparations" and the like. I can do this in post-processing, perhaps with some tfidf approach or some "specificity score" based on where the entity sits in the UMLS tree. But I figured I'd ask if anyone has a better way.

dakinggg commented 1 year ago

The approaches you described are exactly what I would try first. Using tfidf or something similar, and using the umls type tree. Filtering to a higher score threshold can also be helpful. Lastly, you could try training a further entity linking model to distinguish.

kristinlindquist commented 1 year ago

Cool thank you @dakinggg. I will go ahead and close this!