CogStack / MedCAT

Medical Concept Annotation Tool
Other
450 stars 103 forks source link

MedCat fails to correctly detect enumerations of (negative) diagnoses #230

Closed Imipenem closed 2 years ago

Imipenem commented 2 years ago

Hey,

First of all thanks for the great package.

I'm using medcat 1.2.8 and I noticed the following issue:

Example:

text = "Patient suffers from diabetes. Denies hypertension, psychosis and glaucoma"

# let cat be the CAT object, that has been trained and initialized using the model pack/example data from the docs
annotated_text = cat.get_entities(text)

This results in:

{'entities': {2: {'pretty_name': 'Diabetes',
   'cui': 'C0011847',
   'type_ids': ['T047'],
   'types': ['Disease or Syndrome'],
   'source_value': 'diabetes',
   'detected_name': 'diabetes',
   'acc': 0.6452550625169893,
   'context_similarity': 0.6452550625169893,
   'start': 20,
   'end': 28,
   'icd10': [],
   'ontologies': [],
   'snomed': [],
   'id': 2,
   'meta_anns': {'Status': {'value': 'Affirmed',
     'confidence': 0.999997079372406,
     'name': 'Status'}}},
  3: {'pretty_name': 'Hypertensive disease',
   'cui': 'C0020538',
   'type_ids': ['T047'],
   'types': ['Disease or Syndrome'],
   'source_value': 'hypertension',
   'detected_name': 'hypertension',
   'acc': 0.6790682188733697,
   'context_similarity': 0.6790682188733697,
   'start': 37,
   'end': 49,
   'icd10': [],
   'ontologies': [],
   'snomed': [],
   'id': 3,
   'meta_anns': {'Status': {'value': 'Other',
     'confidence': 0.9918639063835144,
     'name': 'Status'}}},
  4: {'pretty_name': 'Psychotic Disorders',
   'cui': 'C0033975',
   'type_ids': ['T048'],
   'types': ['Mental or Behavioral Dysfunction'],
   'source_value': 'psychosis',
   'detected_name': 'psychosis',
   'acc': 0.3484492297815132,
   'context_similarity': 0.3484492297815132,
   'start': 51,
   'end': 60,
   'icd10': [],
   'ontologies': [],
   'snomed': [],
   'id': 4,
   'meta_anns': {'Status': {'value': 'Affirmed',
     'confidence': 0.8026704788208008,
     'name': 'Status'}}},
  5: {'pretty_name': 'Glaucoma',
   'cui': 'C0017601',
   'type_ids': ['T047'],
   'types': ['Disease or Syndrome'],
   'source_value': 'glaucoma',
   'detected_name': 'glaucoma',
   'acc': 0.3833850208933218,
   'context_similarity': 0.3833850208933218,
   'start': 65,
   'end': 73,
   'icd10': [],
   'ontologies': [],
   'snomed': [],
   'id': 5,
   'meta_anns': {'Status': {'value': 'Affirmed',
     'confidence': 0.9999270439147949,
     'name': 'Status'}}}},
 'tokens': []}

As one can see, medcat correctly gets, that there is a diabetes but no hypertension diagnosis. But the "denies" context seems to get lost/ignored in the enumeration after hypertension so psychosis and glaucoma are labeled as "affirmed" although, they should also be "Other" (like negative).

Is this a known Issue? Are there any approaches to solve such issues?

Many thanks in advance ;)

w-is-h commented 2 years ago

Hi @Imipenem, unfortunately, the public model we have for Negation (ie Status) is not the best as it was trained on a relatively small dataset. I've tested the same example on one of our in-hospital models and everything works, but we cannot make it public as it contains confidential information.

I can only suggest that you train your own model for negation using MedCATtrainer, or wait until we publish one of the better models once we get permission for it (for now I'm not able to estimate when this could be).

Example of the output for your text with one of our internal models:

{'entities': {0: {'pretty_name': 'Diabetes mellitus (disorder)',
   'cui': '73211009',
   'type_ids': ['T-11'],
   'types': ['disorder'],
   'source_value': 'diabetes',
   'detected_name': 'diabete',
   'acc': 0.39457637369632725,
   'context_similarity': 0.39457637369632725,
   'start': 21,
   'end': 29,
   'icd10': [],
   'ontologies': ['SNOMED'],
   'snomed': [],
   'id': 0,
   'meta_anns': {'Presence': {'value': 'True',
     'confidence': 1.0,
     'name': 'Presence'},
    'Time': {'value': 'Recent',
     'confidence': 0.9901728630065918,
     'name': 'Time'},
    'Subject': {'value': 'Patient',
     'confidence': 0.973953902721405,
     'name': 'Subject'}}},
  1: {'pretty_name': 'Hypertensive disorder, systemic arterial (disorder)',
   'cui': '38341003',
   'type_ids': ['T-11'],
   'types': ['disorder'],
   'source_value': 'hypertension',
   'detected_name': 'hypertension',
   'acc': 0.5329114772595984,
   'context_similarity': 0.5329114772595984,
   'start': 38,
   'end': 50,
   'icd10': [],
   'ontologies': ['SNOMED'],
   'snomed': [],
   'id': 1,
   'meta_anns': {'Presence': {'value': 'False',
     'confidence': 1.0,
     'name': 'Presence'},
    'Time': {'value': 'Recent',
     'confidence': 0.998069167137146,
     'name': 'Time'},
    'Subject': {'value': 'Patient',
     'confidence': 0.9986799955368042,
     'name': 'Subject'}}},
  2: {'pretty_name': 'Psychotic disorder (disorder)',
   'cui': '69322001',
   'type_ids': ['T-11'],
   'types': ['disorder'],
   'source_value': 'psychosis',
   'detected_name': 'psychosis',
   'acc': 0.3700194746255875,
   'context_similarity': 0.3700194746255875,
   'start': 52,
   'end': 61,
   'icd10': [],
   'ontologies': ['SNOMED'],
   'snomed': [],
   'id': 2,
   'meta_anns': {'Presence': {'value': 'False',
     'confidence': 0.7612127065658569,
     'name': 'Presence'},
    'Time': {'value': 'Recent',
     'confidence': 0.9930446147918701,
     'name': 'Time'},
    'Subject': {'value': 'Patient',
     'confidence': 0.9984433650970459,
     'name': 'Subject'}}},
  3: {'pretty_name': 'Glaucoma (disorder)',
   'cui': '23986001',
   'type_ids': ['T-11'],
   'types': ['disorder'],
   'source_value': 'glaucoma',
   'detected_name': 'glaucoma',
   'acc': 0.7935367539525032,
   'context_similarity': 0.7935367539525032,
   'start': 66,
   'end': 74,
   'icd10': [],
   'ontologies': ['SNOMED'],
   'snomed': [],
   'id': 3,
   'meta_anns': {'Presence': {'value': 'False',
     'confidence': 0.9999976754188538,
     'name': 'Presence'},
    'Time': {'value': 'Recent',
     'confidence': 0.884349524974823,
     'name': 'Time'},
    'Subject': {'value': 'Patient',
     'confidence': 0.9915496706962585,
     'name': 'Subject'}}}},
 'tokens': []}
Imipenem commented 2 years ago

Thanks for your answer, guess this is the way to go then.

I've read in the docs, that if one has access to UMLS or SNOMEDT-CT, one could get access to the cdb and vocab for those.

Would this improve the results as well?

w-is-h commented 2 years ago

That will improve the results with respect to NER+L, because you will have all of UMLS/SNOMED while the public NER+L models are a subset. But the Meta models (Status), will stay the same.

Imipenem commented 2 years ago

Thanks for clarification.