Reproduce CAML's old datasets

Update (03/28/2022): The note preprocessing / vocab selection / word2vec training has been changed to make the input as close to the CAML dataset as we can. Now overall changes in the preprocessing are as follows:

[x] Add option to perform the incorrect behavior with incorrect_code_loading: true (using pd.read_csv() without dtypes & counting duplicates). This will result in the wrong set of top-50 codes.
[x] Add option to train word2vec on the all splits (train+val+test) with train_embed_with_all_split: true
[x] Move truncation from the preprocessing to the dataset class. Change the max length from 2000 to 2500
[x] Change the P@13 metric (in the models on the full dataset) to P@15

The resulting statistics of the datasets are as follows:

	mimic_50	mimic_50_old	mimic 50 (CAML)	mimic_full	mimic_full_old	mimic full (CAML)
Label loading	Correct	Incorrect	Incorrect	Correct	Incorrect	Incorrect
Text process*	Process 2	Process 1	Process 1	Process 2	Process 1	Process 1
W2V training	train	train+val+test	train+dev+test	train	train+val+test	train+dev+test
Vocab select	# occur	# occur	# notes	# occur	# occur	# notes
# labels*	50 (1)	50 (2)	50 (2)	8930	8922	8922
vocab size	50319	31019	51919	51344	58144	51919
train (labels)	44728 (50)	8066 (50)	8066 (50)	47723 (8693)	47723 (8686)	47723 (8686)
val (labels)	1569 (50)	1573 (50)	1573 (50)	1631 (3012)	1631 (3009)	1631 (3009)
test (labels)	3234 (50)	1729 (50)	1729 (50)	3372 (4085)	3372 (4075)	3372 (4075)

*) Process 1: lowercase + remove punctuations + remove numeral-only word, Process 2: Process 1 + remove stop words + stem/lemmatize **) The set of ICD codes are different between 50 (1) and 50 (2).

Below are the results of CAML training on various MIMIC-III top-50 and full datasets, with more variants to see the effect of the w2v training corpus. When we use the CAML's top-50 split, using all splits (train+val+test) for w2v embedding training improves the performance. But that's not the case with our top-50 split.

dataset	instance	w2v training	vocab select	vocab size	macro AUC	micro AUC	macro F1	micro F1	P@5
mimic_50	44728/1569/3234	train	# occur	50319	0.917969	0.942569	0.608479	0.688646	0.662709
mimic_50 + w2v all	44728/1569/3234	train+val+test	# occur	53513	0.917535	0.941546	0.617230	0.686787	0.659060
mimic_50_old + w2v train	8066/1573/1729	train	# occur	26328	0.859774	0.897530	0.482393	0.592196	0.602545
mimic_50_old	8066/1573/1729	train+val+test	# occur	31019	0.881043	0.908731	0.519399	0.610033	0.612955
mimic 50 (CAML)	8066/1573/1729	train+dev+test	# notes on MIMIC-III full	51919	0.875	0.909	0.532	0.614	0.609

dataset	w2v training	vocab select	vocab size	macro AUC	micro AUC	macro F1	micro F1	P@8	P@15
mimic_full	train	# occur	51344	0.890823	0.984352	0.048890	0.498832	0.703181	0.553875
mimic_full + w2v all	train+val+test	# occur	54606	0.892273	0.984531	0.052801	0.507814	0.705738	0.555793
mimic_full_old + w2v train	train	# occur	58114	0.883927	0.983631	0.053312	0.496372	0.696879	0.547786
mimic_full_old	train+val+test	# occur	54723	0.880379	0.983444	0.057407	0.500574	0.696582	0.546777
mimic full (CAML)	train+dev+test	# notes	51919	0.895	0.986	0.088	0.539	0.709	0.561

(# instance are all the same 47723/1631/3372)

========================

This branch reproduces the wrong ICD code loading behavior of the CAML's notebook in our preprocessing pipeline.
We can enable this by setting incorrect_code_loading: true should under preprocessing/params in the preprocessing config files.
Adding this feature requires us to modify multiple steps in src/modules/preprocessing_pipeline.py, so the option should be on the top level of the configuration.

With this feature, now we can have four kinds of ICD coding datasets: mimic3_full, mimic3_50, mimic3_full_old, and mimic3_50_old
As before, we can create these datasets with the command like python run_preprocessing.py --config_path configs/preprocessing/mimic3_full.yml

I will put the statistics on the datasets later.

dalgu90 / icd-coding-benchmark

Reproduce CAML's old datasets #33