Missing diagnosis labels in episode*.csv generated by `extract_episode_from_subjects`

This bug can be found in the two episode*.csv files generated for patient 49037. In both files, no diagnosis columns have label 1, which is clearly not right.

The cause is in preprocessing.py. In function extract_diagnosis_labels, in the input dataframe diagnosis, the ICD9_CODE column has a numerical dtype. This causes the columns of labels to also be numerical. However the match condition in Line 82 is against the hardcoded list diagnosis_labels which contains strings. This means Line 82 will never be true, and no diagnosis value will be set to 1.

This bug affects all episodes who only have numerical diagnosis ICD codes (i.e. no alpha-numerical codes like V28492). In these cases pandas automatically infers the dtype to be int64, rather than object/str, causing the bug.

This bug however does not seem to affect the labels in task-specific datasets, which still look correct.

A fix is to add this line diagnoses['ICD9_CODE'] = diagnoses['ICD9_CODE'].astype(str) before diagnoses['VALUE'] = 1.

YerevaNN / mimic3-benchmarks

Missing diagnosis labels in episode*.csv generated by `extract_episode_from_subjects` #101