CogStack / MedCAT

Medical Concept Annotation Tool
Other
432 stars 102 forks source link

CU-8695m5q4x: Fix issues detecting 1-token concepts #485

Open mart-r opened 1 week ago

mart-r commented 1 week ago

The underlying issue presented in models sometimes being unable to recognise a concept where the same model would recognise an incorrectly typed name in the exact same context.

A few more details as to how I came onto this issue Tested with a few different models: - [1] The 2022/2023 GSTT/KCH trained model - [2] The AU model (where I first saw the issue) - [3] The 2024-06 GSTT-trained model I ran with 2 separate "documents": ``` Patient was diagnosed with diabetes based on previous findings ``` And ``` Patient was diagnosed with diabetis based on previous findings ``` (Note the typo of diabetis instead of diabetes in the 2nd). Some models ([1] and [3]) were able to correctly identify the 2nd (i.e typo'd) version, but not the 1st (i.e correctly typed version). Other models ([2]) didn't identify either.

Turned out the issue was as follows:

This caused the following issue:

This PR provides the following fix:

tomolopolis commented 1 week ago

Task linked: CU-8695m5q4x Fix issue with models sometimes not being able to detect concepts unless there's a typo