Open vojtechhuser opened 6 years ago
Great idea!
Thank you for this great work!
Thanks for the suggestion @vojtechhuser. The mapping needs some work, but sharing the transformed dataset is something that we'd like to do once we're happy with it. We haven't been able to give this project the time it needs just because of competing priorities (research tasks, rebuilding PhysioNet, preparing the next release of MIMIC, etc), but it's on our to-do list.
any updates on this? We would like to use mimic3 in a Data Quality totorial at OHDSI symposium and desperately need someone who ran the code from this repo and can collaborate with us.
As soon as we publish something on PhysioNet we have to be able to support it and the ETL isn't ready. We are currently building ETLs for other ICU datasets so that our model doesn't overfit to MIMIC.
If by data quality you mean running Achilles, then I have done that, but the results aren't that useful on MIMIC because of the unique data structure and deidentification approach (e.g. deidentified ages ~ 300).
@alistairewj the use @vojtechhuser is referring to is for a tutorial on how to use Achilles and two other data quality tool sets designed for use with OMOP data sources. The version of MIMIC we use doesn't need to be free of defects. It just needs to be usable - i.e. it won't break the tools because there are empty or missing tables or missing required variables. To the extent that it will resemble a real world data set with typical data quality issues that the tools can identify, it will meet our needs. Before I spend the effort to get this to run, can you give your sense of how likely it is to meet those needs?
MIMIC is a real world dataset, from a real hospital, but I don't know if I can fully answer your question without knowing the ins and outs of the tools you'll use. The ETL is incomplete; there are still a lot of unmapped concepts. I ran Achilles a few months ago and the output is hopefully informative for you (see below). You'll notice that there are a lot of reported "errors" around times/dates due to our deidentification approach (we randomly shift patient data into the future, therefore doing any analysis which aggregates distinct patients over time is flawed).
Type | Message |
---|---|
ERROR | 3-Number of persons by year of birth; should not have year of birth in the future, (n=44,374) |
ERROR | 101-Number of persons by age, with age at first observation period; should not have age > 150, (n=1,991) |
ERROR | 400-Number of persons with at least one condition occurrence, by condition_concept_id; 2 concepts in data are not in vocabulary |
ERROR | 400-Number of persons with at least one condition occurrence, by condition_concept_id; 228 concepts in data are not in correct vocabulary |
ERROR | Death event outside observation period, 510-Number of death records outside valid observation period; count (n=8,980) should not be > 0 |
ERROR | 600-Number of persons with at least one procedure occurrence, by procedure_concept_id; 39 concepts in data are not in correct vocabulary |
ERROR | 610-Number of procedure occurrence records outside valid observation period; count (n=883) should not be > 0 |
ERROR | 700-Number of persons with at least one drug exposure, by drug_concept_id; 4 concepts in data are not in correct vocabulary |
ERROR | 706 - Distribution of age by drug_concept_id (count = 1); min value should not be negative |
ERROR | 710-Number of drug exposure records outside valid observation period; count (n=12,437,292) should not be > 0 |
ERROR | 711-Number of drug exposure records with end date < start date; count (n=15,922) should not be > 0 |
ERROR | 717 - Distribution of quantity by drug_concept_id (count = 7); min value should not be negative |
ERROR | 806 - Distribution of age by observation_concept_id (count = 2); min value should not be negative |
ERROR | 810-Number of observation records outside valid observation period; count (n=85,787) should not be > 0 |
ERROR | 814-Number of observation records with no value (numeric, string, or concept); count (n=99,839) should not be > 0 |
NOTIFICATION | Unmapped data over percentage threshold in:Measurement |
NOTIFICATION | Count of unmapped source values exceeds threshold in: drug_exposure |
NOTIFICATION | [GeneralPopulationOnly] Count of distinct specialties of providers in the PROVIDER table is below threshold |
NOTIFICATION | No body weight data in MEASUREMENT table (under concept_id 3,025,315 (LOINC code 29,463-7)) |
NOTIFICATION | Unmapped data over percentage threshold in:Condition |
NOTIFICATION | Unmapped data over percentage threshold in:Procedure |
NOTIFICATION | Unmapped data over percentage threshold in:DrugExposure |
NOTIFICATION | Unmapped data over percentage threshold in:Observation |
WARNING | 5-Number of persons by ethnicity; data with unmapped concepts |
WARNING | 101-Number of persons by age, with age at first observation period; should not have age > 125, (n=1,991) |
WARNING | 400-Number of persons with at least one condition occurrence, by condition_concept_id; data with unmapped concepts |
WARNING | 402-Number of persons by condition occurrence start month, by condition_concept_id; 2 concepts have a 100% change in monthly count of events |
WARNING | 420-Number of condition occurrence records by condition occurrence start month; theres a 100% change in monthly count of events |
WARNING | 512-Distribution of time from death to last drug (count = 1); max value should not be positive, otherwise its a zombie with data >1mo after death |
WARNING | 514-Distribution of time from death to last procedure (count = 1); max value should not be positive, otherwise its a zombie with data >1mo after death |
WARNING | 515-Distribution of time from death to last observation (count = 1); max value should not be positive, otherwise its a zombie with data >1mo after death |
WARNING | 600-Number of persons with at least one procedure occurrence, by procedure_concept_id; data with unmapped concepts |
WARNING | 602-Number of persons by procedure occurrence start month, by procedure_concept_id; 6 concepts have a 100% change in monthly count of events |
WARNING | 620-Number of procedure occurrence records by procedure occurrence start month; theres a 100% change in monthly count of events |
WARNING | 700-Number of persons with at least one drug exposure, by drug_concept_id; data with unmapped concepts |
WARNING | 702-Number of persons by drug exposure start month, by drug_concept_id; 22 concepts have a 100% change in monthly count of events |
WARNING | 717-Distribution of quantity by drug_concept_id (count = 83); max value should not be > 600 |
WARNING | 720-Number of drug exposure records by drug exposure start month; theres a 100% change in monthly count of events |
WARNING | 800-Number of persons with at least one observation occurrence, by observation_concept_id; data with unmapped concepts |
WARNING | 802-Number of persons by observation occurrence start month, by observation_concept_id; 7 concepts have a 100% change in monthly count of events |
WARNING | 820-Number of observation records by observation start month; theres a 100% change in monthly count of events |
@alistairewj this is helpful. Thanks.
Any updates on sharing a complete version of mimic in omop on physionet?
Especially now in Covid19 times, I would very much like to work with a proper cdm at home, as I can't access my organisation's cdm. Alternatives databases, like Synpuf, are too limited for the analyses I want to test.
Thank you, Tom
We would be happy to share an OMOP version of MIMIC-III on PhysioNet. See also https://github.com/MIT-LCP/mimic-code/issues/725.
I suggest that someone from the OMOP community takes responsibility for putting together a submission to PhysioNet. The person should:
Once we receive a well described version of the dataset, we can move forward with publication. For instructions on submitting the project, see: https://physionet.org/about/publish/#sharing
That is great. I will work on a revised proposal that I am happy to revise multiple times until I hit all your requirements to the satisfaction of the PhysioNet reviewing team. (tagging @parisni )
Hi all. Good news. I would be pleased to give some help to make this possible.
Today - I started a draft.
I will add @parisni and other important people.
I plan to use (let me know if that is wrong)
@vojtechhuser those access settings are correct. Not sure about "OMOP shaped data" as the title of the dataset, but presumably this is a placeholder!
The title is changed now. Please let me know who else want to be invited (or not want to be). So far, I have
What people thing about number of projects. One project will be for full data. Should we create another project that converts Demo data? (I am happy to do what MIT tells me).
I would like an invite! I would love to be able to skip ETLing the data and getting it in the OMOP format from source.
If published as a credentialed project then it would be accessible to MIMIC users. The invite mentioned is for the authors of the project, i.e. those who helped create the ETL.
One project will be for full data. Should we create another project that converts Demo data?
Yes, I think separate projects for each dataset is best. One of the benefits is that the MIMIC demo is open access (https://physionet.org/content/mimiciii-demo/1.4/), so the same permissions could be applied to the OMOP version.
Excellent point Tom.
based on guidance - I have now created a sister "demo" project and invited folks there too.
I'm seeing whether the N3C project can support some of this work - pay for some of people's time and get more hand on deck. Who has a guess at the amount of work involved?
Folks leading that seem to have some leeway with unspecified cash allocations to fund it - it being the National Covid Cohort Collaborative (N3C) - and indicate potential interest in supporting this. So I'm eager to respond to their question about the amount of work. I'd take a guess myself but I'm the least fit amongst this group to do so.
Interesting, thanks Andrew. @parisni @alistairewj @aparrot89 any thoughts on whether we should be putting in additional work to improve the mapping before the dataset is shared?
Hi, I am interested to be part of this project and am already a registered user of Physionet.
Formal funding would be great.
See notes in this shared folder: https://drive.google.com/open?id=1j-x-rwuYJr2nIs5zxCW6ST_Q-vPc1tfN
For folks willing to help, please put your name next to a table that you volunteer to tackle (port to GBQ or improve)
I propose a plan were multiple versions are released. We need initial versions to make people aware of it. E.g., v0.1 with some tables. After that - some version (e.g., v1.0 can be using existing mapping) and v2.0 can be with improved mapping. Perfect should not be the enemy of the good enough.
I can't say I agree with releasing an incomplete dataset on PhysioNet and justifying the lack of comprehension with a "v0.1" tag.
google link permission was fixed. You can sign up for individual tables again here: https://drive.google.com/open?id=1j-x-rwuYJr2nIs5zxCW6ST_Q-vPc1tfN (file central notes)
@vojtechhuser, I'd be happy to help join this effort! I put myself on the measurement table.
The project description is now also in Central Notes. At this link (pick file central notes) https://drive.google.com/open?id=1j-x-rwuYJr2nIs5zxCW6ST_Q-vPc1tfN @AEW0330
The project is from now on called Argos
This OHDSI forum thread is used for major updates.
https://forums.ohdsi.org/t/argos-project-2020-omoped-mimic-project/10926
technical items will still be posted here.
What is the need for the codename? MIMIC-OMOP seems clearer.
This ETL allows local user to download and convert-at-many sites
How about convert-once and allow sites to download the converted dataset.
This would save MIMIC users some effort and make MIMIC more used. (and published about; getting credit).