Minor issue in preprocessing demographics

YerevaNN / mimic3-benchmarks

Python suite to construct benchmark machine learning datasets from the MIMIC-III 💊 clinical database.

https://arxiv.org/abs/1703.07771

MIT License

805 stars 329 forks source link

Minor issue in preprocessing demographics #27

Closed Taha-Bahadori closed 6 years ago

Taha-Bahadori commented 7 years ago

I believe the following is a better ways of processing the demographics:

e_map = {'ASIAN': 1,
         'BLACK': 2,
         'CARIBBEAN ISLAND': 2,
         'HISPANIC': 3,
         'SOUTH AMERICAN': 3,
         'WHITE': 4,
         'MIDDLE EASTERN': 4,
         'PORTUGUESE': 4,
         'AMERICAN INDIAN': 5,
         'NATIVE HAWAIIAN': 6,
         'UNABLE TO OBTAIN': 0,
         'PATIENT DECLINED TO ANSWER': 0,
         'UNKNOWN': 0,
         '': 0}

Your preprocessing ignores American Indians and Native Hawaiians and also does not treat Caribbean Islanders, South Americans, Middle Easterns properly.

turambar commented 6 years ago

@Taha-Bahadori good catch.

So if I understand correctly, you are proposing the following changes

eliminate old group 5 for "OTHER"
add new groups 5 and 6 for AMERICAN INDIAN and NATIVE HAWAIIAN
map AMERICAN INDIAN to new group 5 instead of 0
map CARIBBEAN ISLAND to 2 (with BLACK) instead of "OTHER"
map SOUTH AMERICAN to 3 (with HISPANIC) instead of "OTHER"
map MIDDLE EASTERN to 4 (with WHITE) instead of "OTHER"
map PORTUGESE to 4 (with WHITE) instead of "OTHER"

That all makes sense to me -- my primary concern regards the sizes of groups 5 (AMERICAN INDIAN: 54) and 6 (NATIVE HAWAIIAN: 18). Those are pretty small, and they might get smaller after filtering.

Regardless I think we'll add some version of your proposal to the next release, which is coming soon.

turambar commented 6 years ago

PR: https://github.com/YerevaNN/mimic3-benchmarks/pull/33

Taha-Bahadori commented 6 years ago

Based on your statistics, I think we can easily merge AMERICAN INDIAN and NATIVE HAWAIIAN categories to 0 category (which is essentially everything else). I also understand that UNKOWN, DECLINED, and UNABLE are conceptually different from those two but I think statistically this won't make any difference.

turambar commented 6 years ago

See this PR: https://github.com/YerevaNN/mimic3-benchmarks/pull/33

Will merge the PR soon now that the 1.0 release is done.

hrayrhar commented 6 years ago

Merged !