OHDSI / Eunomia

An R package that facilitates access to a variety of OMOP CDM sample data sets.
https://ohdsi.github.io/Eunomia/
42 stars 11 forks source link

EunomiaDatasets format #50

Closed ablack3 closed 1 year ago

ablack3 commented 1 year ago

I've built some additional Eunomia datasets available here in duckdb v0.8 format.

I was wondering if we can use parquet files instead of csv since parquet is smaller than csv.

I was also wondering if I can include a full vocabulary as the full vocab tables are very helpful for examples.

All of these cdms use the same vocab so it would be more efficient to store the vocab tables just once and allow eunomia to use two seperate links, one for vocab tables and one for the other cdm tables.

What do you all think?

@fdefalco, @schuemie

fdefalco commented 1 year ago

I'm open to switching to parquet. I would like to test/benchmark the impact of the switch on file size and load performance time.

We haven't included a full vocabulary in the past because of licensing of some vocabularies being different across organizations. The size of the overall vocabulary is also somewhat cumbersome to include in the sample as it would account for 95% of the data size if included in full.

My desire was to have an API for Athena that could be used to construct a vocabulary to support a particular data source in a meaningful way, but that has not been implemented.

I know that some of our samples contain different vocabulary content because we used some pruning techniques to only include vocabulary content that applied to the content of the clinical tables. In that case, a single, complete vocabulary might also be cumbersome. That depends on the loading and query time of whatever technologies we are using for storage and loading. Perhaps the switch to parquet alleviates those concerns?

schuemie commented 1 year ago

In my experience Parquet is really efficient, and at least in Python you can directly run DuckDB of Parquet files (which is also really performant). This is certainly an interesting area to explore.

ablack3 commented 1 year ago

The full vocab is soooo nice though. And it's optional. The default Eunomia is still the same as it has been. But the user could choose to download a larger eunomia cdm (~3-5GB) with a full vocab. I don't know much about licensing issues but my understanding is that there is some issue with the CPT code descriptions...

Yea a more complete vocabulary is very nice for testing and examples. The current Eunomia dataset does not even have domain_id filled in which caused an issue for me in testing. One possible approach is for me to experiment a bit in CDMConnector and the contribute things that seem to work well into Eunomia since Eunomia has more reverse dependencies.

In the example datasets I created the vocab is the bulk of the data so it would be nice to not repeat that multiple times in the github repo.

ablack3 commented 1 year ago

I agree an Athena API is needed. I do know one place to download full vocabs from a public s3 bucket though... https://github.com/OHDSI/OHDSIonAWS/blob/0ecc5f2af0bd2b487804612b82f86a1b2d1577bf/datasources/CMSDESynPUF100k.sql#L25C31-L25C76

So it seems like Eunomia should be ok too then. 😄

fdefalco commented 1 year ago

My suggestion is to confirm with the vocabulary team that we're allowed to openly distribute the complete vocabulary before adding it to any particular package. This might also be an opportunity to version the vocabulary releases, which wouldn't necessarily go in the Eunomia package, but we could provide a separate repository and then enable downloads much like we do with Eunomia.

The sample data sets typically have a very small subset of concept identifiers in them, what benefit is there of having the complete vocabulary in those cases?

ablack3 commented 1 year ago

The sample data sets typically have a very small subset of concept identifiers in them, what benefit is there of having the complete vocabulary in those cases?

Many people just getting in to OHDSI have no access to any OMOP CDM data. For newcomers, being able to very quickly start using an OMOP CDM is a game-changer for onboarding. Similar efforts like OHDSI in a Box and OHDSI on AWS both provide CDM datasets with full vocabularies.

I'm not really sure of all the implications of including a vocab with just the subset of concepts that occur in the CDM data. Certainly if you do this you won't get correct answers to queries like "What are all the descendants of concept X?" so it won't be a good tool to train people to use and experiment with the vocab tables.

One exercise we did last week was to build a concept set for a clinical idea and I think for that exercise it's helpful to have a complete vocabulary even if those concepts don't all occur in the data.

This reminds me of similar discussion I've had pondering if a CDM is missing tables or fields should it still be considered a CDM? Why not leave off the irrelevant tables? In practice I think that in the wild CDM tables or columns can be entirely missing and the user has to adapt and decide if it is still a "valid cdm".

An incomplete CDM is a very strange CDM to use because you don't really know what you're missing. "Is my query/test giving an unexpected result because my vocab is incomplete?" is a constant question to consider. Even after using Eunomia for years I just recently realized that I could not rely on concept.domain_id in Eunomia because it is not populated until it caused a test to fail.

Eunomia only supports very specific cohorts. A small-ish but still complete CDM like the ~6GB covid19_200k synthea or 3GB 10k covid19 CDM I created for the Oxford summer school last week allows newcomers more flexibility to come up with their own studies on somewhat realistic data. My hope/plan is to push this even further by making the synthetic data in Eunomia more closely aligned to the real data that is not public and newcomers, independent researchers, and people working at small companies don't generally have access to.

Imagine if Eunomia had a function you could point at a CDM and optionally a generated cohort in the CDM that would output a synthea json spec representing the state transitions (codes and their probabilities) that synthea could use to generate synthetic data matching that patient population and CDM database. Doesn't seem impossible.

So basically I'd like to write a function to build these synthea modules in a data driven way....

image
fdefalco commented 1 year ago

I'm not really sure of all the implications of including a vocab with just the subset of concepts that occur in the CDM data. Certainly if you do this you won't get correct answers to queries like "What are all the descendants of concept X?" so it won't be a good tool to train people to use and experiment with the vocab tables.

Actually, for the sample data that I had created, the full hierarchy for concepts that occur in the event tables were included. But yes, if you attempted that type of query on the sample data set for a concept not represented you would get no results.

One exercise we did last week was to build a concept set for a clinical idea and I think for that exercise it's helpful to have a complete vocabulary even if those concepts don't all occur in the data.

Isn't the demo ATLAS a good tool for building a concept set? Or were you focusing on how to write SQL to perform this task?

This reminds me of similar discussion I've had pondering if a CDM is missing tables or fields should it still be considered a CDM? Why not leave off the irrelevant tables? In practice I think that in the wild CDM tables or columns can be entirely missing and the user has to adapt and decide if it is still a "valid cdm".

I think this conflates two different ideas. It is one thing to leave tables and/or columns out of the CDM which would modify the database's structure. It is something different to limit the content in the tables.

An incomplete CDM is a very strange CDM to use because you don't really know what you're missing. "Is my query/test giving an unexpected result because my vocab is incomplete?" is a constant question to consider. Even after using Eunomia for years I just recently realized that I could not rely on concept.domain_id in Eunomia because it is not populated until it caused a test to fail.

Also at this point I think we need to clarify which data set you are describing as Eunomia is no longer a single data set, but a utility to pull a particular data set. So if one of the sample data sets does not have a populated concept.domain_id we should fix that as an issue.

Eunomia only supports very specific cohorts.

Again, this will vary by the data set you are pulling.

Imagine if Eunomia had a function you could point at a CDM and optionally a generated cohort in the CDM that would output a synthea json spec representing the state transitions (codes and their probabilities) that synthea could use to generate synthetic data matching that patient population and CDM database. Doesn't seem impossible.

This is a very interested idea and one that I have discussed with several people over the years. There are companies that have proprietary technology to create synthetic data from real data with some level of complete statistical compatibility. I have written some code that starts to do what you are describing with the synthea modules generated by CDM driven information and I think there is a lot of potential there however I do not think it should be part of Eunomia. Eunomia should remain a way to access sample data sets. Developing methods to generate modules from CDM data to output Synthea modules is a project I would definitely like to be a part of if that is the goal. One question that has been raised by this however, is for the underlying data set that you would be using to 'train' your module, at what point do you break the legal / license obligations by publishing data. For example if I take a licensed data set, randomize one variable and then publish it, I would be in breach of contract. Clearly these are very different things, but it is a consideration that must be explored to determine when a process would result an a 'digital twin' representation of a database and be akin to publishing data that is otherwise unauthorized to be published.

I think an interim step that could be taken is instead to increase concept representation in the synthea generated data but writing additional modules or authoring changes to existing modules. For example, we can see from that PNAS publication that the set of drugs used to treat hypertension is much broader than the data that the modules currently generate.

https://www.pnas.org/doi/10.1073/pnas.1510502113

Improving modules so they more accurately reflect even published findings would be a valuable contribution.

ablack3 commented 1 year ago

Thanks for that feedback @fdefalco!

In my examples when I refer the "the Eunomia dataset" I'm referring to the gibleed dataset that is currently the only dataset provided by latest release of the Eunomia package. It looks like concept.domain_id is there so I must have made another mistake in my code somewhere.

The issue is one of closure. The OMOP vocabulary is a closed set of relationships meaning that I can move around through the relationships and stay inside the set of concepts. So perhaps the vocab in GiBleed Eunomia is closed but a lot of relationships between concepts are not represented I think.

(Note: this post was updated when I found an error in my code)

We can move the discussion about the train_synthea(cdm) -> json::synthea_module function to another thread. I'll invite you to a project in-progress https://github.com/ablack3/SyntheaCdmFactory based largely on your work on ETL-Synthea. I'd love to collaborate on this. I just run the ETL in duckdb instead of postgres which makes it easier.

ablack3 commented 1 year ago

Here I tried (in a short amount of time) to find all concepts in the gibleed Eunomia vocabulary tables and then check if they were in the concept table. Maybe I made a mistake but it looks to me like there are many concept in the gibleed vocabulary that are not in the concept table. I think creating a "closed" subset of the vocabulary is not a trivial task and you need to make decisions about what relationships to include and which not to include. To me it's much easier just to provide the option use the whole vocabulary as it is released. Plus then you can run Patrick's SQL queries that create cool messages.

The goal here is to make the OMOP onramp easier. Of course anyone in the community can set up postgres, apply for a umls key, request a vocab download, download the bundle, run the cpt script, upload the vocab tables (possibly dealing with weird errors having to do with the table size being large, character encodings, or csv parsing issues due to how quotes are used, etc..) I've encountered all of that. But why have newcomers go through all that if they don't have to?

cd <- Eunomia::getEunomiaConnectionDetails()

con <- DatabaseConnector::connect(cd)
#> Connecting using SQLite driver

DBI::dbListTables(con)
#>  [1] "care_site"             "cdm_source"            "cohort"               
#>  [4] "cohort_attribute"      "concept"               "concept_ancestor"     
#>  [7] "concept_class"         "concept_relationship"  "concept_synonym"      
#> [10] "condition_era"         "condition_occurrence"  "cost"                 
#> [13] "death"                 "device_exposure"       "domain"               
#> [16] "dose_era"              "drug_era"              "drug_exposure"        
#> [19] "drug_strength"         "fact_relationship"     "location"             
#> [22] "measurement"           "metadata"              "note"                 
#> [25] "note_nlp"              "observation"           "observation_period"   
#> [28] "payer_plan_period"     "person"                "procedure_occurrence" 
#> [31] "provider"              "relationship"          "source_to_concept_map"
#> [34] "specimen"              "visit_detail"          "visit_occurrence"     
#> [37] "vocabulary"
library(DatabaseConnector)

df <- querySql(con, 
  "
  with cte as(
    select ancestor_concept_id as id from concept_ancestor
    union
    select descendant_concept_id as id from concept_ancestor
    union
    select concept_id_1 as id from concept_relationship
    union 
    select concept_id_2 as id from concept_relationship
    union
    select vocabulary_concept_id as id from vocabulary
  ) 
  select distinct id, concept_id from cte a
  left join concept b on a.id = b.concept_id
")

head(df, 20)
#>       ID CONCEPT_ID
#> 1      0          0
#> 2    204         NA
#> 3    231         NA
#> 4    232         NA
#> 5    236         NA
#> 6    238         NA
#> 7    243         NA
#> 8    245         NA
#> 9    252         NA
#> 10  5029         NA
#> 11  5045         NA
#> 12  5046         NA
#> 13  5047         NA
#> 14  9201       9201
#> 15  9202       9202
#> 16  9203       9203
#> 17 24818         NA
#> 18 25297         NA
#> 19 28060      28060
#> 20 30753      30753
dplyr::count(df, is.na(CONCEPT_ID))
#>   is.na(CONCEPT_ID)     n
#> 1             FALSE   442
#> 2              TRUE 39418
disconnect(con)

Created on 2023-06-29 with reprex v2.0.2

ablack3 commented 1 year ago

Isn't the demo ATLAS a good tool for building a concept set? Or were you focusing on how to write SQL to perform this task?

Atlas is a great tool for building concept sets. But yes our workshop was focused on doing the same thing from R using the Darwin CodelistGenerator package which writes the SQL for you and have various options for searching the vocab. The nice thing about this is that you can fairly easily take that code list and convert it to a generated cohort on your data using Capr. Of course Atlas makes this easy as well but I met several people working at organizations that have OMOP data but don't have Atlas because it's actually not very easy to set up. I have personal experience with it taking more than two years to deploy Atlas and it sounds like that experience isn't unique. Lots of hurtles, not just technical.

fdefalco commented 1 year ago

Part of the limitation on the Eunomia vocabulary was based on size, we pruned until we got under the 5mb limit that was enforced by CRAN. The other difference you might be seeing is based on the version of the vocabulary, as that was created years ago.

To be clear, I'm not opposed to providing an easier way to obtain a copy of the vocabulary. Somewhere along the way I understood we were not allowed to openly distribute the entire vocabulary. If @cgreich can approve OHDSI's ability to openly distribute all of the vocabulary, then having it available as parquet files as a dataset in a vocabulary repository would be great.

ablack3 commented 1 year ago

I'm going to experiment with Parquet files in CDMConnector and report back how it goes. Thanks for the discussion everyone 🚀 🌔