AnEMIC: An Error-reduced MIMIC ICD Coding Benchmark

Automatic ICD coding benchmark based on the MIMIC dataset.
Please check our paper on EMNLP 2022 (demo track): AnEMIC: A Framework for Benchmarking ICD Coding Models

NOTE: 🚧 The repo is under active development. Please see below for available datasets/models.

Automatic diagnosis coding[^1] in clinical NLP is a task to predict the diagnoses and the procedures during a hospital stay given the summary of the stay (discharge summary). The labels of the task are mostly represented in ICD (international classification of disease) codes which are alpha-numeric codes widely adopted by hospitals in the US. The most popular database used in automatic diagnosis coding is the MIMIC-III dataset, but the preprocessing varies among the literature, and some of them are done incorrectly. Such inconsistency and error make it hard to compare different methods on automatic diagnosis coding and, arguably, results in incorrect evaluations of the methods.

This code repository aims to provide a standardized benchmark of automatic diagnosis coding with the MIMIC-III database. The benchmark encompasses all the procedures of ICD coding: dataset pre-processing, model training/evaluation, and interactive web demo.

We currently provide (items in parentheses are under development):

Four preset of preprocessed datasets: MIMIC-III full, top-50, full (old), top-50 (old), where we referred to (old) as the version of CAML[^2].
ICD coding models: CNN, CAML, MultiResCNN[^3], DCAN[^4], TransICD[^5], Fusion[^6], (LAAT)
Interactive demo

Preparation

Please put the MIMIC-III csv.gz files (v1.4) under datasets/mimic3/csv/. You can also create symbolic links pointing to the files.

Pre-processing

Please run the following command to generate the MIMIC-III top-50 dataset or generate other versions using the config files in configs/preprocessing.

$ python run_preprocessing.py --config_path configs/preprocessing/default/mimic3_50.yml

Training / Testing

Please run the following command to train, or resume training of, the CAML model on the MIMIC-III top-50 dataset. You can evaluate the model with --test options and use other config files under configs.

$ python run.py --config_path configs/caml/caml_mimic3_50.yml         # Train
$ python run.py --config_path configs/caml/caml_mimic3_50.yml --test  # Test

Training is logged through TensorBoard graph (located in the output dir under results/). Also, logging through text files is performed on pre-processing, training, and evaluation. Log files will be located under logs/.

Run demo

After you train a model, you can run an interactive demo app of it (CAML on MIMIC-III top-50, for example) by running

$ streamlit run app.py -- --config_path configs/demo/multi_mimic3_50.yml  # CAML, MultiResCNN, DCAN, Fusion on MIMIC-III top-50

You can write own config file specifying modules as same as in pre-processing and training

Results

MIMIC-III full

Model	macro AUC	micro AUC	macro F1	micro F1	P@8	P@15
CNN	0.835±0.001	0.974±0.000	0.034±0.001	0.420±0.006	0.619±0.002	0.474±0.004
CAML	0.893±0.002	0.985±0.000	0.056±0.006	0.506±0.006	0.704±0.001	0.555±0.001
MultiResCNN	0.912±0.004	0.987±0.000	0.078±0.005	0.555±0.004	0.741±0.002	0.589±0.002
DCAN	0.848±0.009	0.979±0.001	0.066±0.005	0.533±0.006	0.721±0.001	0.573±0.000
TransICD	0.886±0.010	0.983±0.002	0.058±0.001	0.497±0.001	0.666±0.000	0.524±0.001
Fusion	0.910±0.003	0.986±0.000	0.081±0.002	0.560±0.003	0.744±0.002	0.589±0.001

MIMIC-III top-50

Model	macro AUC	micro AUC	macro F1	micro F1	P@5
CNN	0.913±0.002	0.936±0.002	0.627±0.001	0.693±0.003	0.649±0.001
CAML	0.918±0.000	0.942±0.000	0.614±0.005	0.690±0.001	0.661±0.002
MultiResCNN	0.928±0.001	0.950±0.000	0.652±0.006	0.720±0.002	0.674±0.001
DCAN	0.934±0.001	0.953±0.001	0.651±0.010	0.724±0.005	0.682±0.003
TransICD	0.917±0.002	0.939±0.001	0.602±0.002	0.679±0.001	0.643±0.001
Fusion	0.932±0.001	0.952±0.000	0.664±0.003	0.727±0.003	0.679±0.001

MIMIC-III full (old)

Model	macro AUC	micro AUC	macro F1	micro F1	P@8	P@15
CNN	0.833±0.003	0.974±0.000	0.027±0.005	0.419±0.006	0.612±0.004	0.467±0.001
CAML	0.880±0.003	0.983±0.000	0.057±0.000	0.502±0.002	0.698±0.002	0.548±0.001
MultiResCNN	0.905±0.003	0.986±0.000	0.076±0.002	0.551±0.005	0.738±0.003	0.586±0.003
DCAN	0.837±0.005	0.977±0.001	0.063±0.002	0.527±0.002	0.721±0.001	0.572±0.001
TransICD	0.882±0.010	0.982±0.001	0.059±0.008	0.495±0.005	0.663±0.007	0.521±0.006
Fusion	0.910±0.003	0.986±0.000	0.076±0.007	0.555±0.008	0.744±0.003	0.588±0.003

MIMIC-III top-50 (old)

Model	macro AUC	micro AUC	macro F1	micro F1	P@5
CNN	0.892±0.003	0.920±0.003	0.583±0.006	0.652±0.008	0.627±0.007
CAML	0.865±0.017	0.899±0.008	0.495±0.035	0.593±0.020	0.597±0.016
MultiResCNN	0.898±0.006	0.928±0.003	0.590±0.012	0.666±0.013	0.638±0.005
DCAN	0.915±0.002	0.938±0.001	0.614±0.001	0.690±0.002	0.653±0.004
TransICD	0.895±0.003	0.924±0.002	0.541±0.010	0.637±0.003	0.617±0.005
Fusion	0.904±0.002	0.930±0.001	0.606±0.009	0.677±0.003	0.640±0.001

Authors

(in alphabetical order)

Abheesht Sharma @abheesht17
Juyong Kim @dalgu90
Suhas Shanbhogue @SuhasShanbhogue

Cite this work

@InProceeding{juyong2022anemic,
  title = {AnEMIC: A Framework for Benchmarking ICD Coding Models},
  author = {Kim, Juyong and Sharma, Abheesht and Shanbhogue, Suhas and Ravikumar, Pradeep and Weiss, Jeremy C},
  booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP), System Demonstrations},
  year = {2022},
  publisher = {ACL},
  url = {https://github.com/dalgu90/icd-coding-benchmark},
}

[^1]: Also referred to as medical coding, clinical coding, or simply ICD coding in other literature. They may have different meanings in detail. [^2]: Mullenbach, et al., Explainable Prediction of Medical Codes from Clinical Text, NAACL 2018 (paper, code) [^3]: Li and Yu, ICD Coding from Clinical Text Using Multi-Filter Residual Convolutional Neural Network, AAAI 2020 (paper, code) [^4]: Ji, et al., Dilated Convolutional Attention Network for Medical Code Assignment from Clinical Text, Clinical NLP Workshop 2020 (paper, code) [^5]: Biswas, et al., TransICD: Transformer Based Code-wise Attention Model for Explainable ICD Coding, AIME 2021 (paper, code) [^6]: Luo, et al., Fusion: Towards Automated ICD Coding via Feature Compression, ACL 2020 Findings (paper, code)

dalgu90 / icd-coding-benchmark

readme