The underlying issue presented in models sometimes being unable to recognise a concept where the same model would recognise an incorrectly typed name in the exact same context.
A few more details as to how I came onto this issue
Tested with a few different models:
- [1] The 2022/2023 GSTT/KCH trained model
- [2] The AU model (where I first saw the issue)
- [3] The 2024-06 GSTT-trained model
I ran with 2 separate "documents":
```
Patient was diagnosed with diabetes based on previous findings
```
And
```
Patient was diagnosed with diabetis based on previous findings
```
(Note the typo of diabetis instead of diabetes in the 2nd).
Some models ([1] and [3]) were able to correctly identify the 2nd (i.e typo'd) version, but not the 1st (i.e correctly typed version).
Other models ([2]) didn't identify either.
Turned out the issue was as follows:
The Vocab based NER was checking against 2 name versions for a token
The normalised token
The lower case token
The first of the name versions that was in cdb.snames (the subnames) was used going forward
So the normalised token will be checked first, if it is a subname, it'll be used
Only if the normalised token was not a subname would the lower case token be checked
This caused the following issue:
When looking at diabetes, the normalised name was diabete
And this name was in the CDB's subnames
As such, it was used as the name of the concept, rather than diabetes itself
This PR provides the following fix:
Checks if either of the name variants are in subnames (cdb.snames) or actual concept names (cdb.name2cuis)
If a name is in the concept names, it will be used
Otherwise the name that was in subnames will be used (if one of them was in subnames)
NOTE:
Currently preference is on the normalised name
I.e if both names are a concept name or a subname, the normalised name is used
But perhaps we should do this the other way around? I don't really know, but that was the preference before so I left it the same.
The underlying issue presented in models sometimes being unable to recognise a concept where the same model would recognise an incorrectly typed name in the exact same context.
A few more details as to how I came onto this issue
Tested with a few different models: - [1] The 2022/2023 GSTT/KCH trained model - [2] The AU model (where I first saw the issue) - [3] The 2024-06 GSTT-trained model I ran with 2 separate "documents": ``` Patient was diagnosed with diabetes based on previous findings ``` And ``` Patient was diagnosed with diabetis based on previous findings ``` (Note the typo of diabetis instead of diabetes in the 2nd). Some models ([1] and [3]) were able to correctly identify the 2nd (i.e typo'd) version, but not the 1st (i.e correctly typed version). Other models ([2]) didn't identify either.Turned out the issue was as follows:
cdb.snames
(the subnames) was used going forwardThis caused the following issue:
diabetes
, the normalised name wasdiabete
diabetes
itselfThis PR provides the following fix:
cdb.snames
) or actual concept names (cdb.name2cuis
)