Questions About Example Part_3_2_Extracting_Diseases_from_Electronic_Health_Records.ipynb

JBarsotti commented 5 months ago

This is an amazing module. Thanks for all your hard work.

I was working through the notebook notebooks/introductory/Part_3_2_Extracting_Diseases_from_Electronic_Health_Records.ipynb, and I have a couple of questions:

Why do we need to retrain the modelpack on our own personal? I've tried it without retraining, and it seems to work okay, still. Am I missing something?
I have access to the entire UMLS database. I tried to use that as my medpack model, but it doesn't seem to work with the code in notebooks/introductory/Part_3_2_Extracting_Diseases_from_Electronic_Health_Records.ipynb. Even on the simple example "This patient suffers from diabetes," it isn't able to recognize diabetes as an entity. When I run it on large clinical notes, it does not return a lot of CUIs that map to preferred names. They are just listed as "Unknown." Any ideas?

Thanks for an awesome module! It really is great.

mart-r commented 5 months ago

Hi,

To answer your questions:

Retraining is necessary if the data you use the model on is in some way different from the data it was originally trained on. For instance, different hospitals/trusts may have different conventions on how to describe similar situations. So if the base model works well enough for you, then that's great - keep using it. But in general, in order to improve performance on a particular dataset, fine tuning on that dataset - or another similar dataset - is needed.
By "the entire UMLS database" do you perhaps mean the full UMLS model as distributed in medcat readme? I will assume that's what you meant. Unfortunately, the models we provide publicly are not guaranteed to be particularly performant. The full UMLS model is an example model. While it was trained (in a self-supervised capacity) on MIMIC-III (which undoubtably has many-many references to diabetes), it has not received any validation on its performance. My best bet is that the model was unable to disambiguate the name and was thus unable to determine which concept was being referenced in the training data. UMLS is a massive ontology, diabetes may refer to many different concepts. And due to the self-supervised nature of the training it received, the model was unable to properly learn the name. But again, this is just speculation.

JBarsotti commented 5 months ago

Thank you for the fast reply! Your responses make sense. One other question:

If I wanted to include ICD-10 codes as part of a model, is there a way to do that using the prebuilt models, or do I need a new one?

mart-r commented 5 months ago

Some models have ICD-10 mappings baked into them. So you may be able to look up the CUIs in cat.cdb.addl_info['cui2icd10']. In fact, if you use CAT.get_entities, the default behaviour would be to make use of the ICD-10 mappings embedded in the CDB (if they exist). A recognised entity in this case could look something like this:

{'pretty_name': 'Fever', 'cui': '386661006', 'type_ids': ['67667581'], 'types': ['finding'], 'source_value': 'fever', 'detected_name': 'fever', 'acc': 1.0, 'context_similarity': 1.0, 'start': 29, 'end': 34, 'icd10': ['R509', 'R508', 'R502', 'P819', 'P818', 'T670', 'O752', 'P810', 'O864'], 'ontologies': ['SNOMED-CT'], 'snomed': [], 'id': 2, 'meta_anns': {}}

If the ICD-10 mappings do not exist within a model pack, you would need to add them or map the Snomed or UMLS term yourself.

JBarsotti commented 5 months ago

Thakns again for the reply! Is there a model out there that you would recommend that has ICD codes baked in?

mart-r commented 5 months ago

I don't know off the top of my head. But the SNOMED model is more likely to have ICD10 mappings since we have built in functionality for that within the SNOMED preprocessing script.

CogStack / MedCATtutorials

Questions About Example Part_3_2_Extracting_Diseases_from_Electronic_Health_Records.ipynb #23