Update (03/28/2022):
The note preprocessing / vocab selection / word2vec training has been changed to make the input as close to the CAML dataset as we can. Now overall changes in the preprocessing are as follows:
[x] Add option to perform the incorrect behavior with incorrect_code_loading: true (using pd.read_csv() without dtypes & counting duplicates). This will result in the wrong set of top-50 codes.
[x] Add option to train word2vec on the all splits (train+val+test) with train_embed_with_all_split: true
[x] Move truncation from the preprocessing to the dataset class. Change the max length from 2000 to 2500
[x] Change the P@13 metric (in the models on the full dataset) to P@15
The resulting statistics of the datasets are as follows:
mimic_50
mimic_50_old
mimic 50 (CAML)
mimic_full
mimic_full_old
mimic full (CAML)
Label loading
Correct
Incorrect
Incorrect
Correct
Incorrect
Incorrect
Text process*
Process 2
Process 1
Process 1
Process 2
Process 1
Process 1
W2V training
train
train+val+test
train+dev+test
train
train+val+test
train+dev+test
Vocab select
# occur
# occur
# notes
# occur
# occur
# notes
# labels*
50 (1)
50 (2)
50 (2)
8930
8922
8922
vocab size
50319
31019
51919
51344
58144
51919
train (labels)
44728 (50)
8066 (50)
8066 (50)
47723 (8693)
47723 (8686)
47723 (8686)
val (labels)
1569 (50)
1573 (50)
1573 (50)
1631 (3012)
1631 (3009)
1631 (3009)
test (labels)
3234 (50)
1729 (50)
1729 (50)
3372 (4085)
3372 (4075)
3372 (4075)
*) Process 1: lowercase + remove punctuations + remove numeral-only word, Process 2: Process 1 + remove stop words + stem/lemmatize
**) The set of ICD codes are different between 50 (1) and 50 (2).
Below are the results of CAML training on various MIMIC-III top-50 and full datasets, with more variants to see the effect of the w2v training corpus. When we use the CAML's top-50 split, using all splits (train+val+test) for w2v embedding training improves the performance. But that's not the case with our top-50 split.
dataset
instance
w2v training
vocab select
vocab size
macro AUC
micro AUC
macro F1
micro F1
P@5
mimic_50
44728/1569/3234
train
# occur
50319
0.917969
0.942569
0.608479
0.688646
0.662709
mimic_50 + w2v all
44728/1569/3234
train+val+test
# occur
53513
0.917535
0.941546
0.617230
0.686787
0.659060
mimic_50_old + w2v train
8066/1573/1729
train
# occur
26328
0.859774
0.897530
0.482393
0.592196
0.602545
mimic_50_old
8066/1573/1729
train+val+test
# occur
31019
0.881043
0.908731
0.519399
0.610033
0.612955
mimic 50 (CAML)
8066/1573/1729
train+dev+test
# notes on MIMIC-III full
51919
0.875
0.909
0.532
0.614
0.609
dataset
w2v training
vocab select
vocab size
macro AUC
micro AUC
macro F1
micro F1
P@8
P@15
mimic_full
train
# occur
51344
0.890823
0.984352
0.048890
0.498832
0.703181
0.553875
mimic_full + w2v all
train+val+test
# occur
54606
0.892273
0.984531
0.052801
0.507814
0.705738
0.555793
mimic_full_old + w2v train
train
# occur
58114
0.883927
0.983631
0.053312
0.496372
0.696879
0.547786
mimic_full_old
train+val+test
# occur
54723
0.880379
0.983444
0.057407
0.500574
0.696582
0.546777
mimic full (CAML)
train+dev+test
# notes
51919
0.895
0.986
0.088
0.539
0.709
0.561
(# instance are all the same 47723/1631/3372)
========================
This branch reproduces the wrong ICD code loading behavior of the CAML's notebook in our preprocessing pipeline.
We can enable this by setting incorrect_code_loading: true should under preprocessing/params in the preprocessing config files.
Adding this feature requires us to modify multiple steps in src/modules/preprocessing_pipeline.py, so the option should be on the top level of the configuration.
With this feature, now we can have four kinds of ICD coding datasets: mimic3_full, mimic3_50, mimic3_full_old, and mimic3_50_old
As before, we can create these datasets with the command like python run_preprocessing.py --config_path configs/preprocessing/mimic3_full.yml
Update (03/28/2022): The note preprocessing / vocab selection / word2vec training has been changed to make the input as close to the CAML dataset as we can. Now overall changes in the preprocessing are as follows:
incorrect_code_loading: true
(usingpd.read_csv()
withoutdtypes
& counting duplicates). This will result in the wrong set of top-50 codes.train_embed_with_all_split: true
The resulting statistics of the datasets are as follows:
*) Process 1: lowercase + remove punctuations + remove numeral-only word, Process 2: Process 1 + remove stop words + stem/lemmatize **) The set of ICD codes are different between 50 (1) and 50 (2).
Below are the results of CAML training on various MIMIC-III top-50 and full datasets, with more variants to see the effect of the w2v training corpus. When we use the CAML's top-50 split, using all splits (train+val+test) for w2v embedding training improves the performance. But that's not the case with our top-50 split.
(# instance are all the same 47723/1631/3372)
========================
This branch reproduces the wrong ICD code loading behavior of the CAML's notebook in our preprocessing pipeline.
We can enable this by setting
incorrect_code_loading: true
should underpreprocessing/params
in the preprocessing config files.Adding this feature requires us to modify multiple steps in
src/modules/preprocessing_pipeline.py
, so the option should be on the top level of the configuration.With this feature, now we can have four kinds of ICD coding datasets:
mimic3_full
,mimic3_50
,mimic3_full_old
, andmimic3_50_old
As before, we can create these datasets with the command like
python run_preprocessing.py --config_path configs/preprocessing/mimic3_full.yml
I will put the statistics on the datasets later.