MIT-LCP / mimic-code

MIMIC Code Repository: Code shared by the research community for the MIMIC family of databases
https://mimic.mit.edu
MIT License
2.41k stars 1.5k forks source link

Missing social history makes automated medical coding challenging #1663

Open JoakimEdin opened 8 months ago

JoakimEdin commented 8 months ago

Prerequisites

Description

Automated medical coding (also called medical code prediction) is a growing machine learning task that aims to predict medical codes given a discharge summary. MIMIC-IV has become a popular dataset to train and evaluate such models. However, there is an issue. Since your de-identification algorithm removed the social history section, certain annotated medical codes are impossible to predict. For instance, the medical codes representing whether the patient smokes (e.g., F17.210 and Z87.891) are often annotated in MIMIC-IV without being mentioned in the discharge summary. This is because of the missing social history.

The consequences of the missing section are that the models are trained on labels that are impossible to predict and are evaluated unfairly every time the necessary information would have been in the social history. Consequently, MIMIC-IV is a noisier dataset for automated medical coding than MIMIC-III (MIMIC-III contains the social history).

Is there a way to de-identify the discharge summaries without removing the social histories?

alistairewj commented 7 months ago

Thanks for raising this, I wasn't aware of it personally and that's an unfortunate side effect. It's open for discussion but I think the removal of the social history section remains useful for deidentification given the expansion of the dataset to ED patients. The approach could be improved by applying NER on the raw PHI note and allowing through only the medically relevant segments of the social history (alcohol use, smoking use, drug use). It's unlikely we'll get to doing that any time soon though, sorry.