Emory-HITI / AI-Vengers

59 stars 22 forks source link

race labels for MIMIC-CXR ? #39

Open robintibor opened 3 years ago

robintibor commented 3 years ago

Hi,

I wondered how to obtain the race labels for MIMIC - CXR ?

I do have access to https://physionet.org/content/mimic-cxr/2.0.0/ and https://physionet.org/content/mimic-cxr-jpg/2.0.0/ but could not locate where you get the white/asian/black labels?

Like how to create the modified_viewposition_race_4-race-ethnicity_60-10-30_split_with_gender_age_ver_b.csv that you use in the training code?

Thanks for any help, Best, Robin

blackboxradiology commented 3 years ago

Hi Robin,

Race labels can be found here Under the core directory, in the admissions dataset. From there you can join the subject_id with the CXR subject_id.

Let us know if we can help with anything else!

robintibor commented 3 years ago

ah amazing thanks that clears it up! Other questions, am I understading correctly there is some code that preprocesses MIMIC-CXR and that is not in this repo? Like, one cannot just follow:

  1. Fork/Download the GitHub repository.
  2. Fetch the data from the data URLs for open-source datasets and drop them in the data folder.
  3. Run the corresponding training code and save the trained model in the models folder.

for MIMIC-CXR, because https://github.com/Emory-HITI/AI-Vengers/blob/cbdf593b0d852e3078abbc72cf92aad03496511d/training_code/CXR_training/MIMIC/MIMIC_resnet34_race_detection_2021_06_29.ipynb starts from some dataframe that you have created with some code that is not in this repo?

blackboxradiology commented 3 years ago

That's correct. At the moment you would have to join the csv dataframes and make your own train-val-test splits, like what we did with modified_viewposition_race_4-race-ethnicity_60-10-30_split_with_gender_age_ver_b.csv

robintibor commented 3 years ago

I see. One more question that came up: Did you try to handle subjects with multiple values for ethnicity in any way? For example, following code shows there are 168 subjects that had been entered both as BLACK/AFRICAN AMERICAN and WHITE and 2489 subjects with OTHER and WHITE:

admissions_df = pd.read_csv(os.path.join(mimic_folder, 'admissions.csv'))
ethnicity_df = admissions_df.loc[:,['subject_id', 'ethnicity']].drop_duplicates()

v = ethnicity_df.subject_id.value_counts()
subject_id_more_than_once = v.index[v.gt(1)]

ambiguous_ethnicity_df = ethnicity_df[ethnicity_df.subject_id.isin(subject_id_more_than_once)]

grouped = ambiguous_ethnicity_df.groupby('subject_id')
grouped.aggregate(lambda x: "_".join(sorted(x))).ethnicity.value_counts()
blackboxradiology commented 3 years ago

Wow! Great catch! As far I know we were unaware of this multiple ethnicity problem. I will look into this and test using these changes. I suspect it could improve performance by reducing noise from mislabeled patients.

Thank you!