allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.72k stars 229 forks source link

Unexpected abbreviation detection behaviour #441

Closed mpetruc closed 1 year ago

mpetruc commented 2 years ago

Let me start by thanking you for developing and putting out this tremendously helpful framework for biomedical text processing. I (and am sure many others) are deeply grateful for all your effort and creativity. I've been experimenting with the abbreviation_detector module which has been working great. Until i found this situation: Given the sentence: "The thyroid hormone receptor (TR) inhibiting retinoic malate receptor (RMR) isoforms mediate ligand-independent repression." abbreviation_detector finds the following abbreviations: Abbreviation Definition TR (5, 6) thyroid hormone receptor RMR (12, 13) retinoic malate receptor receptor (3, 4) receptor (RMR receptor (10, 11) receptor (RMR

So, the word "receptor" is incorrectly identified as abbreviation. This happens only if there is one single word between "(TR)" and "retinoic". If another token (word, space) is introduced before OR after the separating word (in this case, "inhibiting"), abbreviation_detector works correctly identifying only the 2 abbreviations (TR and RMR).
From my perspective this is totally unexpected. Could this be a bug in the algorithm? or maybe something i'm doing wrong? Thanks a lot m

dakinggg commented 2 years ago

This appears to be an unfortunate (but fixable) edge case for the algorithm. Basically its matching the opening paren before TR against the closing paren after RMR, and taking everything in between as a candidate long form, and then receptor before (TR) happens to be an acceptable short form for the long form receptor (RMR. Any longer distance between the two parens would have been filtered out, and if receptor didn't happen to match the other receptor it also wouldn't have gotten through, this was right on the edge. Probably we should check and make sure that the parens inside the candidate aren't unbalanced.

mpetruc commented 2 years ago

Thank you so much for the quick and thoughtful response. Is there anything i can do at this point to help? Filing a bug maybe?

dakinggg commented 2 years ago

This serves as the bug, thanks! and I'd be happy to review a PR fixing it if you wanted to. Basically the abbreviation detector should not match parentheses that are not matched to each other.