MIT-LCP / mimic-code

MIMIC Code Repository: Code shared by the research community for the MIMIC family of databases
https://mimic.mit.edu
MIT License
2.5k stars 1.5k forks source link

No distinction between ethnicity and race in MIMIC-IV #1236

Open broganjb opened 2 years ago

broganjb commented 2 years ago

Prerequisites

Description

When using MIMIC-IV as a source of validation data, my colleagues and I realized that there is no distinction between race and ethnicity in the admissions table of mimic_core.

We ran the follow query

select distinct ethnicity from mimic_core.admissions;

and received the following 8 distinct outputs

BLACK/AFRICAN
AMERICAN INDIAN/ALASKA NATIVE
UNABLE TO OBTAIN
ASIAN
OTHER
UNKNOWN
HISPANIC/LATINO
WHITE

We were wondering how others have dealt with the issue of no distinction between race and ethnicity. The guidelines from the US Census Bureau is to split ethnicity into hispanic/latino and not hispanic/latino:

https://www.cosb.us/home/showpublisheddocument/5935/637356700118370000.

However, the ethnicity variable in MIMIC-IV also contains what we commonly define as race (i.e. white, black, asian, american indian/alaska native). It would be great to get to the bottom of this as the demographics of MIMIC-IV are important for reporting, especially when comparing model performance across subpopulations to describe potential ethnic and racial disparities. There appear to be some subject_ids that have a race for one admission and an ethnicity for another admission, which further confuses reporting (example: subject_id==15743696).

Lastly, we are trying to figure out how others reconciled issues around multiple ethnicities being reported for the same subject_id for different hadm_ids. We understand that data are not always complete, but is it standard practice to report a specified race (i.e. BLACK/AFRICAN AMERICAN) rather than OTHER if a subject_id has at least one admission with the specified race?

alistairewj commented 2 years ago

Yeah, so off the top of my head:

marymlucas commented 2 years ago

I'm still fairly new to using the MIMIC datasets and a lot of the work I'm doing is around health disparities, so the race/ethnicity variables are ones I'm thinking deeply about. The approach I'm thinking of taking with this is just dealing with the race/ethnicity classifications in my R or Python code after I've already created my cohorts.

I'm working on writing some code that splits that column into separate race and ethnicity for each patient in my cohort based on the US Census recommendations and current thinking in the literature (see for example https://link.springer.com/article/10.1007/s40037-020-00602-3).

It would involve first finding all the admissions for a particular patient and seeing how the ethnicity variable is coded, and if both ethnicity and race values appear at different admissions as you mentioned then I would use that to populate both columns, and if not then the unknown would be coded as such.

Another option I've considered is creating a new table in my local copy of MIMIC-IV with subject_id, patient_race, and patient_ethnicity, based on the same code.

Not sure if this is the best approach but it's what I'm thinking right now. Happy to consider any and all suggestions.

whiskey0504 commented 1 month ago

Yeah, so off the top of my head:

  • I'm pretty sure it was called ethnicity in MIMIC-II, so we've sort of kept the column name as a legacy.
  • We do not have any documentation of US Census style "ethnicity" in the raw data (maybe it exists, but it's not in our warehouse).
  • the column is documented on hospital admission and so... unfortunately there is some inconsistency. We could settle on a "best" way to do this and have a query in this repo (I welcome suggestions here). The column was intentionally kept in the admissions table to allow for transparency in this.
  • We do aggregate values for deidentification purposes (some entries have only 1-2 individuals) and we will look to increasing the granularity here.

Hi, I was wondering about the same issue on obtaining race and ethnicity info, especially a separate ethnicity variable on Hispanic vs. non-Hispanic.

I was wondering if you happen to know the logic by which the source data were collapsed into the current categories -- e.g. if a patient's race is chosen as "Black" and "Asian" within the same hospitalization, are they collapsed as "Black"? likewise, does the current "HISPANIC/LATINO" category include White Hispanic, such that the current "White" category would include only Non-Hispanic Whites?

Thanks so much and let me know if it's better if I open a separate issue.

alistairewj commented 1 month ago

No problem this is a good issue to ask the follow up in. The latest version(s) have more granular race categories (both v2.2 and v3.0). What used to be BLACK is now BLACK/CARIBBEAN ISLAND, BLACK/AFRICAN AMERICAN, etc. The logic does not collapse distinct races together, as there is a MULTIPLE RACE/ETHNICITY value in the raw data which we retain.

The only data that is still removed during the deid process as of v2.2 and v3.0 is (1) tribe status of native americans, (2) some infrequent south east asian races are merged into south east asian, and (3) some Caribbean islands are merged. The rest of the data is as we get it from the hospital.