OHDSI / dbt-synthea

[Under development] A dbt ETL project to convert a Synthea synthetic data set into the OMOP CDM
https://ohdsi.github.io/dbt-synthea/
Apache License 2.0
15 stars 6 forks source link

[Under developement] dbt-synthea

The purpose of this project is to re-create the Synthea-->OMOP ETL implemented in https://github.com/OHDSI/ETL-Synthea using dbt.

The project is currently under development and is not yet ready for production use.

OHDSI Symposium 2024 Abstract

Who Is This Project For?

We built dbt-synthea to demonstrate the power of dbt for building OMOP ETLs. ETL developers will benefit from this project because they can use it as inspiration for their own dbt-based OMOP ETL. Analysts will appreciate the transparency of dbt's SQL-based ETL code and utilize this project to better understand how source data may be mapped into the OMOP CDM. Software developers are welcome to use this project to transform a Synthea data of their choosing into OMOP to use for testing.

...and this is just the beginning! We hope someday to grow the OHDSI dbt ecosystem to include a generic dbt project template and source-specific macros and models. Stay tuned and please reach out if you are interested in contributing.

Developer Setup

Currently this project is set up to run an OMOP ETL into either duckdb or Postgres. Setup instructions for each are provided below.

By default, the project will source the Synthea and OMOP vocabulary data from seed files stored in this repository. The seed Synthea dataset is a small dataset of 27 patients and their associated clinical data. The vocabulary seeds contain a subset of the OMOP vocabulary limited only to the concepts relevant to this specific Synthea dataset.

Users are welcomed, however, to utilize their own Synthea and/or OMOP vocabulary tables as sources. Instructions for the "BYO data" setup are provided below.

Prerequisites

Repo Setup

  1. Clone this repository to your machine
  2. cd into the repo directory and set up a virtual environment:
    python3 -m venv dbt-env
    • If you are using VS Code, create a .env file in the root of your repo workspace (touch .env) and add a PYTHONPATH entry for your virtual env (for example, if you cloned your repo in your computer's home directory, the entry will read as: PYTHONPATH="~/dbt-synthea/dbt-env/bin/python")
    • Now, in VS Code, once you set this virtualenv as your preferred interpreter for the project, the vscode config in the repo will automatically source this env each time you open a new terminal in the project. Otherwise, each time you open a new terminal to use dbt for this project, run:
      source dbt-env/bin/activate         # activate the environment for Mac and Linux OR
      dbt-env\Scripts\activate            # activate the environment for Windows

DuckDB Setup

  1. In your virtual environment install requirements for duckdb (see here for contents)

    pip3 install -r requirements/duckdb.txt
    pre-commit install
  2. Set up your profiles.yml file:

    • Create a directory .dbt in your root directory if one doesn't exist already, then create a profiles.yml file in .dbt
    • Add the following block to the file:
      synthea_omop_etl:
      outputs:
      dev:
      type: duckdb
      path: synthea_omop_etl.duckdb
      schema: dbt_synthea_dev
      target: dev
  3. Ensure your profile is setup correctly using dbt debug:

    dbt debug
  4. Load dbt dependencies:

    dbt deps
  5. If you'd like to run the default ETL using the pre-seeded Synthea dataset, run dbt seed to load the CSVs with the Synthea dataset and vocabulary data. This materializes the seed CSVs as tables in your target schema (vocab) and a _synthea schema (Synthea tables). Then, skip to step 9 below.

    dbt seed
  6. If you'd like to run the ETL on your own Synthea dataset, first toggle the seed_source variable in dbt_project.yml to false. This will tell dbt not to look for the source data in the seed schemas.

  7. [BYO DATA ONLY] Load your Synthea and Vocabulary data into the database by running the following commands (modify the commands as needed to specify the path to the folder storing the Synthea and vocabulary csv files, respectively). The vocabulary tables will be created in the target schema specified in your profiles.yml for the profile you are targeting. The Synthea tables will be created in a schema named "_synthea". NOTE only Synthea v3.0.0 is supported at this time.

    file_dict=$(python3 scripts/python/get_csv_filepaths.py path/to/synthea/csvs)
    dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: false}"
    file_dict=$(python3 scripts/python/get_csv_filepaths.py path/to/vocab/csvs)
    dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: true}"
  8. Seed the location mapper and currently unused empty OMOP tables:

    dbt seed --select states omop
  9. Build the OMOP tables:

    dbt build
    # or `dbt run`, `dbt test`

Postgres Setup

  1. In your virtual environment install requirements for Postgres (see here for contents)

    pip3 install -r requirements/postgres.txt
    pre-commit install
  2. Set up a local Postgres database with a dedicated schema for developing this project (e.g. dbt_synthea_dev)

  3. Set up your profiles.yml file:

    • Create a directory .dbt in your root directory if one doesn't exist already, then create a profiles.yml file in .dbt
    • Add the following block to the file:
      synthea_omop_etl:
      outputs:
      dev:
      dbname: <name of local postgres DB>
      host: <host, e.g. localhost>
      user: <username>
      password: <password>
      port: 5432
      schema: dbt_synthea_dev
      threads: 4 # See https://docs.getdbt.com/docs/running-a-dbt-project/using-threads for more details
      type: postgres
      target: dev
  4. Ensure your profile is setup correctly using dbt debug:

    dbt debug
  5. Load dbt dependencies:

    dbt deps
  6. If you'd like to run the default ETL using the pre-seeded Synthea dataset, run dbt seed to load the CSVs with the Synthea dataset and vocabulary data. This materializes the seed CSVs as tables in your target schema (vocab) and a _synthea schema (Synthea tables). Then, skip to step 10 below.

    dbt seed
  7. If you'd like to run the ETL on your own Synthea dataset, first toggle the seed_source variable in dbt_project.yml to false. This will tell dbt not to look for the source data in the seed schemas.

  8. [BYO DATA ONLY] Create the empty vocabulary and Synthea tables by running the following commands. The vocabulary tables will be created in the target schema specified in your profiles.yml for the profile you are targeting. The Synthea tables will be created in a schema named "_synthea".

    dbt run-operation create_vocab_tables
    dbt run-operation create_synthea_tables
  9. [BYO DATA ONLY] Use the technology/package of your choice to load the OMOP vocabulary and raw Synthea files into these newly-created tables. NOTE only Synthea v3.0.0 is supported at this time.

  10. Seed the location mapper and currently unused empty OMOP tables:

    dbt seed --select states omop
  11. Build the OMOP tables:

    dbt build
    # or `dbt run`, `dbt test`