YerevaNN / mimic3-benchmarks

Python suite to construct benchmark machine learning datasets from the MIMIC-III 💊 clinical database.
https://arxiv.org/abs/1703.07771
MIT License
805 stars 329 forks source link

Missing diagnosis labels in episode*.csv generated by `extract_episode_from_subjects` #101

Open mistycheney opened 4 years ago

mistycheney commented 4 years ago

This bug can be found in the two episode*.csv files generated for patient 49037. In both files, no diagnosis columns have label 1, which is clearly not right.

The cause is in preprocessing.py. In function extract_diagnosis_labels, in the input dataframe diagnosis, the ICD9_CODE column has a numerical dtype. This causes the columns of labels to also be numerical. However the match condition in Line 82 is against the hardcoded list diagnosis_labels which contains strings. This means Line 82 will never be true, and no diagnosis value will be set to 1.

This bug affects all episodes who only have numerical diagnosis ICD codes (i.e. no alpha-numerical codes like V28492). In these cases pandas automatically infers the dtype to be int64, rather than object/str, causing the bug.

This bug however does not seem to affect the labels in task-specific datasets, which still look correct.

A fix is to add this line diagnoses['ICD9_CODE'] = diagnoses['ICD9_CODE'].astype(str) before diagnoses['VALUE'] = 1.

KimballCai commented 3 years ago

I find this problem too, and this problem occurs in many episodes.