OHDSI / dbt-synthea

[Under development] A dbt ETL project to convert a Synthea synthetic data set into the OMOP CDM
https://ohdsi.github.io/dbt-synthea/
Apache License 2.0
16 stars 6 forks source link

Synthea to OMOP transformation using dbt #4

Closed vvcb closed 5 months ago

vvcb commented 9 months ago

This PR includes a working (except drug era which takes forever likely due to lack of indices) implementation of the Synthea to OMOP transformation using dbt to replicate the workflow in https://github.com/OHDSI/ETL-Synthea.

dbt is a transformation tool and expects that the extract and load parts of the ELT have already been completed. While the repo has the Synthea1k dataset as seed files, users are strongly discouraged from using this and should download their Synthea dataset of choice and Athena vocabulary and load it into their data warehouse.

This implementation assumes that this data is already in the database and provides easily configurable sources.yml files to point dbt to these.

In addition to dbt models, this PR also includes the following additional 'features':

Development was made infinitely easier using the vscode-dbt-power-user extension which allows real time lineage (including column-level lineage) as well as execution of models against the target warehouse.

ToDo:

@katy-sadowski , FYI.

katy-sadowski commented 9 months ago

@vvcb wow, thank you for the work you put in here!

As I recommended in my other comment, please do email me at sadowski@ohdsi.org if you'd like to get involved in our project group. We haven't even had our first meeting yet (but we're working on scheduling time!) so we are in the very early stages here.

I see that you've essentially replicated the ETL-Synthea queries verbatim in dbt models, which is certainly one approach to rebuilding ETL-Synthea in dbt; however, my vision for this project is to take a bit more of a "first principles" approach where we consider the optimal modelling strategy that also leverages the full power of dbt and its breadth of features. All with an eye to building something generalizable across sources and database systems.

A note regarding the seeds: These are absolutely not intended to be part of the final project. The intention is to use these files solely for the purpose of collaborating on development of this project, so we can all ensure we're using the exact same dataset, and so that that dataset can be version controlled if we need to make any changes for the purpose of improving our development workflows. This approach was inspired by dbt's Jaffle Shop tutorial, which stores its raw data as seeds: https://github.com/dbt-labs/jaffle_shop?tab=readme-ov-file#whats-in-this-repo

A note regarding cross-DB support: We are starting with Postgres because it's easy for all of us to download and work with the same database system as we come up with the initial implementation. Once we've got this up and running we'll have a whole phase of the project dedicated to cross-DB support (leveraging macros, other packages, etc.). We have access to testing databases via OHDSI that'll allow us to cover most if not all OHDSI-supported DBMS.

I really do appreciate what you've done here and would love if you could come share your experience with our group as we get started. I know we are going to learn a ton from you if you have the time and interest to join us πŸ˜„

vvcb commented 9 months ago

Thank you @katy-sadowski for taking a look at this. Agree with all your comments, especially going back to first principles. I have taken the reverse engineering route which I found quicker (lazier) but also allowed me to quickly build the DAG to understand how it all works before I can find another Saturday to start taking things apart and rebuilding - reusing models, parameterising better, etc.

This only took a day but will drop you an email regarding future meetings.

katy-sadowski commented 5 months ago

I refactored the code to connect to the newly added stg models, rename some models to "int", update column references, and add vocabulary seeds for a minimal vocab.

burrowse commented 5 months ago

@katy-sadowski @vvcb This is awesome work! I will try to review this week!

While I had it handy, I wanted to link something Martijn put together for the hades-vocabulary tutorial for the ohdsi EU symposium to filter the vocabulary: https://github.com/OHDSI/Tutorial-Hades/blob/main/extras/FilterVocabulary.R from a converted synthea dataset that we could potential refactor to reference the concepts that are present in the synthea source data

katy-sadowski commented 5 months ago

@katy-sadowski @vvcb This is awesome work! I will try to review this week!

Thank you @burrowse !

While I had it handy, I wanted to link something Martijn put together for the hades-vocabulary tutorial for the ohdsi EU symposium to filter the vocabulary: https://github.com/OHDSI/Tutorial-Hades/blob/main/extras/FilterVocabulary.R from a converted synthea dataset that we could potential refactor to reference the concepts that are present in the synthea source data

This is fantastic - thanks for sharing. I will try regenerating the seeds using this (against full copy of vocab I have downloaded). I will also add the setup steps @vvcb has in sqlmesh_synthea for using the full vocab. I think it's useful to provide users with both options.

katy-sadowski commented 5 months ago

I just committed the new vocab shards generated using that script as seeds - thanks so much for sharing that @burrowse ! dbt run now populates the CDM tables as expected πŸ˜„ I also included the Python scripts I ran to generate the seeds.

This change also includes added support for duckdb - thanks @vvcb for the suggestion, and the inspiration from your SQLMesh repo. There's now a quickstart mode in duckdb that should make it super easy for people to bring their own Synthea dataset if they choose. I added create table scripts for Postgres too, but for now deferred the actual load step to the user.