dalgu90 / icd-coding-benchmark

Automatic ICD coding benchmark based on the MIMIC dataset
MIT License
35 stars 5 forks source link

Add configs for MIMIC-III full dataset & old 50 dataset / fix macro AUC metric #25

Closed dalgu90 closed 2 years ago

dalgu90 commented 2 years ago

The purpose of this branch is to add configs for MIMIC-III full and old 50 (with the CAML's split) dataset. The by-product updates of this branch are:

The number of labels and examples in these three datasets (MIMIC-III top-50, MIMIC-III top-50 (old or CAML), and MIMIC-III full) are as follows: Full Top-50 Top-50 (CAML)
Total # classes 8930 50 50
Train split 47723 (8693) 44728 (50) 8052 (50)
Val split 1631 (3012) 1569 (50) 1569 (50)
Test split 3372 (4085) 3234 (50) 1725 (50)

Here the numbers in the last three rows are the number of examples and the number of labels of each split. For example, out of total 8930 labels in the MIMIC-III full dataset, the val split has only 3012 labels that appear as positive in its 1631 examples, and the remaining labels (5918) do not appear as positive.

dalgu90 commented 2 years ago

Thanks Abheest! I renamed the preprocessing config dir.
I also added a python script to run preprocessing, which uses your code snippet. We can run it as like:

$ python run_preprocessing.py --config_path configs/preprocessing/mimic3_50.yml