dalgu90 / icd-coding-benchmark

Automatic ICD coding benchmark based on the MIMIC dataset
MIT License
35 stars 5 forks source link

Reproduce CAML's old datasets #33

Closed dalgu90 closed 2 years ago

dalgu90 commented 2 years ago

Update (03/28/2022): The note preprocessing / vocab selection / word2vec training has been changed to make the input as close to the CAML dataset as we can. Now overall changes in the preprocessing are as follows:

The resulting statistics of the datasets are as follows:

mimic_50 mimic_50_old mimic 50 (CAML) mimic_full mimic_full_old mimic full (CAML)
Label loading Correct Incorrect Incorrect Correct Incorrect Incorrect
Text process* Process 2 Process 1 Process 1 Process 2 Process 1 Process 1
W2V training train train+val+test train+dev+test train train+val+test train+dev+test
Vocab select # occur # occur # notes # occur # occur # notes
# labels* 50 (1) 50 (2) 50 (2) 8930 8922 8922
vocab size 50319 31019 51919 51344 58144 51919
train (labels) 44728 (50) 8066 (50) 8066 (50) 47723 (8693) 47723 (8686) 47723 (8686)
val (labels) 1569 (50) 1573 (50) 1573 (50) 1631 (3012) 1631 (3009) 1631 (3009)
test (labels) 3234 (50) 1729 (50) 1729 (50) 3372 (4085) 3372 (4075) 3372 (4075)

*) Process 1: lowercase + remove punctuations + remove numeral-only word, Process 2: Process 1 + remove stop words + stem/lemmatize **) The set of ICD codes are different between 50 (1) and 50 (2).

Below are the results of CAML training on various MIMIC-III top-50 and full datasets, with more variants to see the effect of the w2v training corpus. When we use the CAML's top-50 split, using all splits (train+val+test) for w2v embedding training improves the performance. But that's not the case with our top-50 split.

dataset instance w2v training vocab select vocab size macro AUC micro AUC macro F1 micro F1 P@5
mimic_50 44728/1569/3234 train # occur 50319 0.917969 0.942569 0.608479 0.688646 0.662709
mimic_50 + w2v all 44728/1569/3234 train+val+test # occur 53513 0.917535 0.941546 0.617230 0.686787 0.659060
mimic_50_old + w2v train 8066/1573/1729 train # occur 26328 0.859774 0.897530 0.482393 0.592196 0.602545
mimic_50_old 8066/1573/1729 train+val+test # occur 31019 0.881043 0.908731 0.519399 0.610033 0.612955
mimic 50 (CAML) 8066/1573/1729 train+dev+test # notes on MIMIC-III full 51919 0.875 0.909 0.532 0.614 0.609
dataset w2v training vocab select vocab size macro AUC micro AUC macro F1 micro F1 P@8 P@15
mimic_full train # occur 51344 0.890823 0.984352 0.048890 0.498832 0.703181 0.553875
mimic_full + w2v all train+val+test # occur 54606 0.892273 0.984531 0.052801 0.507814 0.705738 0.555793
mimic_full_old + w2v train train # occur 58114 0.883927 0.983631 0.053312 0.496372 0.696879 0.547786
mimic_full_old train+val+test # occur 54723 0.880379 0.983444 0.057407 0.500574 0.696582 0.546777
mimic full (CAML) train+dev+test # notes 51919 0.895 0.986 0.088 0.539 0.709 0.561

(# instance are all the same 47723/1631/3372)

========================

This branch reproduces the wrong ICD code loading behavior of the CAML's notebook in our preprocessing pipeline.
We can enable this by setting incorrect_code_loading: true should under preprocessing/params in the preprocessing config files.
Adding this feature requires us to modify multiple steps in src/modules/preprocessing_pipeline.py, so the option should be on the top level of the configuration.

With this feature, now we can have four kinds of ICD coding datasets: mimic3_full, mimic3_50, mimic3_full_old, and mimic3_50_old
As before, we can create these datasets with the command like python run_preprocessing.py --config_path configs/preprocessing/mimic3_full.yml

I will put the statistics on the datasets later.