MIT-LCP / mimic-omop

Mapping the MIMIC-III database to the OMOP schema
MIT License
128 stars 48 forks source link

Use this ETL as a way to provide MIMIC in OMOP directly on the Physionet website #52

Open vojtechhuser opened 6 years ago

vojtechhuser commented 6 years ago

This ETL allows local user to download and convert-at-many sites

How about convert-once and allow sites to download the converted dataset.

This would save MIMIC users some effort and make MIMIC more used. (and published about; getting credit).

dsontag commented 6 years ago

Great idea!

chandryou commented 6 years ago

Thank you for this great work!

tompollard commented 6 years ago

Thanks for the suggestion @vojtechhuser. The mapping needs some work, but sharing the transformed dataset is something that we'd like to do once we're happy with it. We haven't been able to give this project the time it needs just because of competing priorities (research tasks, rebuilding PhysioNet, preparing the next release of MIMIC, etc), but it's on our to-do list.

vojtechhuser commented 6 years ago

any updates on this? We would like to use mimic3 in a Data Quality totorial at OHDSI symposium and desperately need someone who ran the code from this repo and can collaborate with us.

alistairewj commented 6 years ago

As soon as we publish something on PhysioNet we have to be able to support it and the ETL isn't ready. We are currently building ETLs for other ICU datasets so that our model doesn't overfit to MIMIC.

If by data quality you mean running Achilles, then I have done that, but the results aren't that useful on MIMIC because of the unique data structure and deidentification approach (e.g. deidentified ages ~ 300).

AEW0330 commented 6 years ago

@alistairewj the use @vojtechhuser is referring to is for a tutorial on how to use Achilles and two other data quality tool sets designed for use with OMOP data sources. The version of MIMIC we use doesn't need to be free of defects. It just needs to be usable - i.e. it won't break the tools because there are empty or missing tables or missing required variables. To the extent that it will resemble a real world data set with typical data quality issues that the tools can identify, it will meet our needs. Before I spend the effort to get this to run, can you give your sense of how likely it is to meet those needs?

alistairewj commented 6 years ago

MIMIC is a real world dataset, from a real hospital, but I don't know if I can fully answer your question without knowing the ins and outs of the tools you'll use. The ETL is incomplete; there are still a lot of unmapped concepts. I ran Achilles a few months ago and the output is hopefully informative for you (see below). You'll notice that there are a lot of reported "errors" around times/dates due to our deidentification approach (we randomly shift patient data into the future, therefore doing any analysis which aggregates distinct patients over time is flawed).

Type Message
ERROR 3-Number of persons by year of birth; should not have year of birth in the future, (n=44,374)
ERROR 101-Number of persons by age, with age at first observation period; should not have age > 150, (n=1,991)
ERROR 400-Number of persons with at least one condition occurrence, by condition_concept_id; 2 concepts in data are not in vocabulary
ERROR 400-Number of persons with at least one condition occurrence, by condition_concept_id; 228 concepts in data are not in correct vocabulary
ERROR Death event outside observation period, 510-Number of death records outside valid observation period; count (n=8,980) should not be > 0
ERROR 600-Number of persons with at least one procedure occurrence, by procedure_concept_id; 39 concepts in data are not in correct vocabulary
ERROR 610-Number of procedure occurrence records outside valid observation period; count (n=883) should not be > 0
ERROR 700-Number of persons with at least one drug exposure, by drug_concept_id; 4 concepts in data are not in correct vocabulary
ERROR 706 - Distribution of age by drug_concept_id (count = 1); min value should not be negative
ERROR 710-Number of drug exposure records outside valid observation period; count (n=12,437,292) should not be > 0
ERROR 711-Number of drug exposure records with end date < start date; count (n=15,922) should not be > 0
ERROR 717 - Distribution of quantity by drug_concept_id (count = 7); min value should not be negative
ERROR 806 - Distribution of age by observation_concept_id (count = 2); min value should not be negative
ERROR 810-Number of observation records outside valid observation period; count (n=85,787) should not be > 0
ERROR 814-Number of observation records with no value (numeric, string, or concept); count (n=99,839) should not be > 0
NOTIFICATION Unmapped data over percentage threshold in:Measurement
NOTIFICATION Count of unmapped source values exceeds threshold in: drug_exposure
NOTIFICATION [GeneralPopulationOnly] Count of distinct specialties of providers in the PROVIDER table is below threshold
NOTIFICATION No body weight data in MEASUREMENT table (under concept_id 3,025,315 (LOINC code 29,463-7))
NOTIFICATION Unmapped data over percentage threshold in:Condition
NOTIFICATION Unmapped data over percentage threshold in:Procedure
NOTIFICATION Unmapped data over percentage threshold in:DrugExposure
NOTIFICATION Unmapped data over percentage threshold in:Observation
WARNING 5-Number of persons by ethnicity; data with unmapped concepts
WARNING 101-Number of persons by age, with age at first observation period; should not have age > 125, (n=1,991)
WARNING 400-Number of persons with at least one condition occurrence, by condition_concept_id; data with unmapped concepts
WARNING 402-Number of persons by condition occurrence start month, by condition_concept_id; 2 concepts have a 100% change in monthly count of events
WARNING 420-Number of condition occurrence records by condition occurrence start month; theres a 100% change in monthly count of events
WARNING 512-Distribution of time from death to last drug (count = 1); max value should not be positive, otherwise its a zombie with data >1mo after death
WARNING 514-Distribution of time from death to last procedure (count = 1); max value should not be positive, otherwise its a zombie with data >1mo after death
WARNING 515-Distribution of time from death to last observation (count = 1); max value should not be positive, otherwise its a zombie with data >1mo after death
WARNING 600-Number of persons with at least one procedure occurrence, by procedure_concept_id; data with unmapped concepts
WARNING 602-Number of persons by procedure occurrence start month, by procedure_concept_id; 6 concepts have a 100% change in monthly count of events
WARNING 620-Number of procedure occurrence records by procedure occurrence start month; theres a 100% change in monthly count of events
WARNING 700-Number of persons with at least one drug exposure, by drug_concept_id; data with unmapped concepts
WARNING 702-Number of persons by drug exposure start month, by drug_concept_id; 22 concepts have a 100% change in monthly count of events
WARNING 717-Distribution of quantity by drug_concept_id (count = 83); max value should not be > 600
WARNING 720-Number of drug exposure records by drug exposure start month; theres a 100% change in monthly count of events
WARNING 800-Number of persons with at least one observation occurrence, by observation_concept_id; data with unmapped concepts
WARNING 802-Number of persons by observation occurrence start month, by observation_concept_id; 7 concepts have a 100% change in monthly count of events
WARNING 820-Number of observation records by observation start month; theres a 100% change in monthly count of events
AEW0330 commented 6 years ago

@alistairewj this is helpful. Thanks.

tomseinen commented 4 years ago

Any updates on sharing a complete version of mimic in omop on physionet?

Especially now in Covid19 times, I would very much like to work with a proper cdm at home, as I can't access my organisation's cdm. Alternatives databases, like Synpuf, are too limited for the analyses I want to test.

Thank you, Tom

tompollard commented 4 years ago

We would be happy to share an OMOP version of MIMIC-III on PhysioNet. See also https://github.com/MIT-LCP/mimic-code/issues/725.

I suggest that someone from the OMOP community takes responsibility for putting together a submission to PhysioNet. The person should:

Once we receive a well described version of the dataset, we can move forward with publication. For instructions on submitting the project, see: https://physionet.org/about/publish/#sharing

vojtechhuser commented 4 years ago

That is great. I will work on a revised proposal that I am happy to revise multiple times until I hit all your requirements to the satisfaction of the PhysioNet reviewing team. (tagging @parisni )

parisni commented 4 years ago

Hi all. Good news. I would be pleased to give some help to make this possible.

vojtechhuser commented 4 years ago

Today - I started a draft.

I will add @parisni and other important people.

image

vojtechhuser commented 4 years ago

I plan to use (let me know if that is wrong) image

tompollard commented 4 years ago

@vojtechhuser those access settings are correct. Not sure about "OMOP shaped data" as the title of the dataset, but presumably this is a placeholder!

vojtechhuser commented 4 years ago

The title is changed now. Please let me know who else want to be invited (or not want to be). So far, I have

image

vojtechhuser commented 4 years ago

What people thing about number of projects. One project will be for full data. Should we create another project that converts Demo data? (I am happy to do what MIT tells me).

image

jmbanda commented 4 years ago

I would like an invite! I would love to be able to skip ETLing the data and getting it in the OMOP format from source.

alistairewj commented 4 years ago

If published as a credentialed project then it would be accessible to MIMIC users. The invite mentioned is for the authors of the project, i.e. those who helped create the ETL.

tompollard commented 4 years ago

One project will be for full data. Should we create another project that converts Demo data?

Yes, I think separate projects for each dataset is best. One of the benefits is that the MIMIC demo is open access (https://physionet.org/content/mimiciii-demo/1.4/), so the same permissions could be applied to the OMOP version.

AEW0330 commented 4 years ago

Excellent point Tom.

vojtechhuser commented 4 years ago

based on guidance - I have now created a sister "demo" project and invited folks there too.

image

AEW0330 commented 4 years ago

I'm seeing whether the N3C project can support some of this work - pay for some of people's time and get more hand on deck. Who has a guess at the amount of work involved?

AEW0330 commented 4 years ago

Folks leading that seem to have some leeway with unspecified cash allocations to fund it - it being the National Covid Cohort Collaborative (N3C) - and indicate potential interest in supporting this. So I'm eager to respond to their question about the amount of work. I'd take a guess myself but I'm the least fit amongst this group to do so.

tompollard commented 4 years ago

Interesting, thanks Andrew. @parisni @alistairewj @aparrot89 any thoughts on whether we should be putting in additional work to improve the mapping before the dataset is shared?

SSMK-wq commented 4 years ago

Hi, I am interested to be part of this project and am already a registered user of Physionet.

vojtechhuser commented 4 years ago

Formal funding would be great.

See notes in this shared folder: https://drive.google.com/open?id=1j-x-rwuYJr2nIs5zxCW6ST_Q-vPc1tfN

For folks willing to help, please put your name next to a table that you volunteer to tackle (port to GBQ or improve)

image

vojtechhuser commented 4 years ago

I propose a plan were multiple versions are released. We need initial versions to make people aware of it. E.g., v0.1 with some tables. After that - some version (e.g., v1.0 can be using existing mapping) and v2.0 can be with improved mapping. Perfect should not be the enemy of the good enough.

alistairewj commented 4 years ago

I can't say I agree with releasing an incomplete dataset on PhysioNet and justifying the lack of comprehension with a "v0.1" tag.

vojtechhuser commented 4 years ago

google link permission was fixed. You can sign up for individual tables again here: https://drive.google.com/open?id=1j-x-rwuYJr2nIs5zxCW6ST_Q-vPc1tfN (file central notes)

epiben commented 4 years ago

@vojtechhuser, I'd be happy to help join this effort! I put myself on the measurement table.

vojtechhuser commented 4 years ago

The project description is now also in Central Notes. At this link (pick file central notes) https://drive.google.com/open?id=1j-x-rwuYJr2nIs5zxCW6ST_Q-vPc1tfN @AEW0330

vojtechhuser commented 4 years ago

The project is from now on called Argos

This OHDSI forum thread is used for major updates.

https://forums.ohdsi.org/t/argos-project-2020-omoped-mimic-project/10926

technical items will still be posted here.

tompollard commented 4 years ago

What is the need for the codename? MIMIC-OMOP seems clearer.