[NOTES] Experiences with dbt-synthea for Big Patient Datasets

Hey @katy-sadowski,

I ended up opening an issue as Discussions is not enabled for this repo.

Wanted to share some notes on my experiences with you on using this solution for 1 million patients each with 5 year retrospective look back:

The Python solution worked fine although I noticed two things:
- I opened a PR to specify the suggested Python version https://github.com/OHDSI/dbt-synthea/pull/87
- Ran into an issue about cffi being needed -- might be missing from dependencies somehow https://github.com/OHDSI/dbt-synthea/issues/88
- Worked fine with miniconda which was nice!
I felt like the instructions about where the .dbt file should go was a little vague
- Opened a docs PR to maybe help with that: https://github.com/OHDSI/dbt-synthea/pull/89
Strange errors occur with loading vocabularies:
- Made a note of it here: https://github.com/OHDSI/dbt-synthea/issues/90
Loading this data in the step dbt run fails on large tables in the OMOP Schema -- especially those tables which have a large lineage.

I am going to reproduce these errors tomorrow and actually generate 1 million patients from scratch using Synthea and re-run the pipeline to find all the tables that fail to be built.

Hope this helps folks!

~ tcp :deciduous_tree:

Fantastic, thanks @TheCedarPrince ! I think I'll try it on my end as well so we have 2x OS & laptop worth of data on performance :)

Nice work @TheCedarPrince ! I was playing around with this in Snowflake over the weekend and generated a million patients and started running it on their to see how it would benchmark compared to duckdb - keen to hear how you get on!

Question for folks:

Can I run dbt-synthea on "chunks" of data I have manually split up (Meaning, I have folders like csvs_1, csvs_2, etc. and for each folder, have about 200K patients)? And then run dbt-synthea against each of these folders without rebuilding the prior DuckDB database? Was wondering as then I can chunk out the ETL simply this way.

Also, @katy-sadowski , I was going to tinker a little later with also getting hyperfine to work with this process for benchmarking purposes. It's considered the state of the art for benchmarking and is used a ton in numerical or HPC environments -- especially like high energy physics areas. Ran into some problems yesterday when I tried to naively run it so there might be some system environment quirks.

Can I run dbt-synthea on "chunks" of data I have manually split up (Meaning, I have folders like csvs_1, csvs_2, etc. and for each folder, have about 200K patients)? And then run dbt-synthea against each of these folders without rebuilding the prior DuckDB database? Was wondering as then I can chunk out the ETL simply this way.

Yeah you definitely can with duckdb with next to no effort - you'd have to change the sources to glob like this!

Interesting -- what files would need to be changed to accommodate this in dbt-synthea? Because I can load vocab now fine in #90 and mostly the loading of data is fine, but the problem I have, as mentioned:

[ETLing] this data in the step dbt run fails on large tables in the OMOP Schema -- especially those tables which have a large lineage.

So I would want to dbt run multiple times versus loading. Or do you see the load step as helping with the ETL step @lawrenceadams ?

So there are two approaches:

we could either modify the loader script to glob files , or...
more simply - given duckdb doesn't need files loaded before it can use them (@katy-sadowski might correct me - but my understanding as to why we have a step where we actively load files into the pipeline is to keep the source.yml files identical between database systems e.g. Postgres/Duckdb/etc can use the same code), you could modify a sources.yml file to incorporate something like the below:

I would modify the sources file below

https://github.com/OHDSI/dbt-synthea/blob/4c34141f301ad30601bd7c373bab3912da2c26f6/models/staging/synthea/_synthea__sources.yml#L1-L24

to:

version: 2

sources:
  - name: synthea
    meta:
      external_location: "/path/to/synthea/output/{name}/*.csv"
    tables:
      - name: allergies
      - name: careplans
      - name: claims_transactions
      - name: claims
      - name: conditions
      - name: devices
      - name: encounters
      - name: imaging_studies
      - name: immunizations
      - name: medications
      - name: observations
      - name: organizations
      - name: patients
      - name: payer_transitions
      - name: payers
      - name: procedures
      - name: providers
      - name: supplies

and it should work without having to have an explicit load step 🚀

Does that make sense? It blew my mind the first time I realised I could do this! Makes working with partitioned CSV/Parquet files so much easier 😄

I tried loading in the chonkiest of tables (claims transactions) with success - although this took a fair amount of time on a high end Mac...

Latest updates @lawrenceadams and @katy-sadowski :

I generated about one million synthetic patients (with some dead patients included) each with 1 year lookback using Synthea (see my synthea.properties file below):

Click to expand to see `synthea.properties`

``` # Starting with a properties file because it requires no additional dependencies exporter.baseDirectory = ./output/ exporter.use_uuid_filenames = false exporter.subfolders_by_id_substring = false # number of years of history to keep in exported records, anything older than this may be filtered out # set years_of_history = 0 to skip filtering altogether and keep the entire history exporter.years_of_history = 1 # split records allows patients to have one record per provider organization exporter.split_records = false exporter.split_records.duplicate_data = false exporter.metadata.export = true exporter.ccda.export = false exporter.fhir.export = false exporter.fhir_stu3.export = false exporter.fhir_dstu2.export = false exporter.fhir.use_shr_extensions = false exporter.fhir.use_us_core_ig = true exporter.fhir.transaction_bundle = true exporter.fhir.bulk_data = false exporter.groups.fhir.export = false exporter.hospital.fhir.export = true exporter.hospital.fhir_stu3.export = false exporter.hospital.fhir_dstu2.export = false exporter.practitioner.fhir.export = true exporter.practitioner.fhir_stu3.export = false exporter.practitioner.fhir_dstu2.export = false exporter.encoding = UTF-8 exporter.csv.export = true # if exporter.csv.append_mode = true, then each run will add new data to any existing CSVs. if false, each run will clear out the files and start fresh exporter.csv.append_mode = false # if exporter.csv.folder_per_run = true, then each run will have CSVs placed into a unique subfolder. if false, each run will only use the top-level csv folder exporter.csv.folder_per_run = false # included_files and excluded_files list out the files to include/exclude in the csv exporter # only one of these may be set at a time, if both are set then both will be ignored # if neither is set, then all files will be included # see list of files at: https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary # include filenames separated with a comma, ex: patients.csv,procedures.csv,medications.csv # NOTE: the csv exporter does not actively delete files, so if Run 1 you included a file, then Run 2 you exclude that file, the version from Run 1 will still be present exporter.csv.included_files = exporter.csv.excluded_files = exporter.cpcds.export = false exporter.cpcds.append_mode = false exporter.cpcds.folder_per_run = false exporter.cpcds.single_payer = false exporter.cdw.export = false exporter.text.export = false exporter.text.per_encounter_export = false exporter.clinical_note.export = false # parameters for symptoms export exporter.symptoms.csv.export = false # selection mode of conditions or symptom export: 0 = conditions according to exporter.years_of_history. other values = all conditions (entire history) exporter.symptoms.mode = 0 # if exporter.symptoms.csv.append_mode = true, then each run will add new data to any existing CSVs. if false, each run will clear out the files and start fresh exporter.symptoms.csv.append_mode = false # if exporter.symptoms.csv.folder_per_run = true, then each run will have CSVs placed into a unique subfolder. if false, each run will only use the top-level csv folder exporter.symptoms.csv.folder_per_run = false exporter.symptoms.text.export = false # the number of patients to generate, by default # this can be overridden by passing a different value to the Generator constructor generate.default_population = 1 generate.log_patients.detail = simple # options are "none", "simple", or "detailed" (without quotes). defaults to simple if another value is used # none = print nothing to the console during generation # simple = print patient names once they are generated. # detailed = print patient names, atributes, vital signs, etc.. May slow down processing generate.timestep = 604800000 # time is in ms # 1000 * 60 * 60 * 24 * 7 = 604800000 # default demographics is every city in the US generate.demographics.default_file = geography/demographics.csv generate.geography.zipcodes.default_file = geography/zipcodes.csv generate.geography.country_code = US generate.geography.timezones.default_file = geography/timezones.csv generate.geography.foreign.birthplace.default_file = geography/foreign_birthplace.json generate.geography.sdoh.default_file = geography/sdoh.csv # Lookup Table Folder location generate.lookup_tables = modules/lookup_tables/ # Set to true if you want every patient to be dead. generate.only_dead_patients = false # Set to true if you want every patient to be alive. generate.only_alive_patients = false # If both only_dead_patients and only_alive_patients are set to true, # It they will both default back to false # if criteria are provided, (for example, only_dead_patients, only_alive_patients, or a "patient keep module" with -k flag) # this is the maximum number of times synthea will loop over a single slot attempting to produce a matching patient. # after this many failed attempts, it will throw an exception. # set this to 0 to allow for unlimited attempts (but be aware of the possibility that it will never complete!) generate.max_attempts_to_keep_patient = 1000 # if true, tracks and prints out details of transition tables for each module upon completion # note that this may significantly slow down processing, and is intended primarily for debugging generate.track_detailed_transition_metrics = false # If true, person names have numbers appended to them to make them more obviously fake generate.append_numbers_to_person_names = true # if true, the entire population will use veteran prevalence data generate.veteran_population_override = false # these should add up to 1.0 # weighting and categories are inspired by the following but there are no specific hard numbers to point to # http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1694190/pdf/amjph00543-0042.pdf # http://www.ncbi.nlm.nih.gov/pubmed/8122813 generate.demographics.socioeconomic.weights.income = 0.2 generate.demographics.socioeconomic.weights.education = 0.7 generate.demographics.socioeconomic.weights.occupation = 0.1 generate.demographics.socioeconomic.score.low = 0.0 generate.demographics.socioeconomic.score.middle = 0.25 generate.demographics.socioeconomic.score.high = 0.66 generate.demographics.socioeconomic.education.less_than_hs.min = 0.0 generate.demographics.socioeconomic.education.less_than_hs.max = 0.5 generate.demographics.socioeconomic.education.hs_degree.min = 0.1 generate.demographics.socioeconomic.education.hs_degree.max = 0.75 generate.demographics.socioeconomic.education.some_college.min = 0.3 generate.demographics.socioeconomic.education.some_college.max = 0.85 generate.demographics.socioeconomic.education.bs_degree.min = 0.5 generate.demographics.socioeconomic.education.bs_degree.max = 1.0 generate.demographics.socioeconomic.income.poverty = 11000 generate.demographics.socioeconomic.income.high = 75000 generate.birthweights.default_file = birthweights.csv generate.birthweights.logging = false # in Massachusetts, the individual insurance mandate became law in 2006 # in the US, the Affordable Care Act become law in 2010, # and individual and employer mandates took effect in 2014. # mandate.year will determine when individuals with an occupation score above mandate.occupation # receive employer mandated insurance (aka "private" insurance). # prior to mandate.year, anyone with income greater than the annual cost of an insurance plan # will purchase the insurance. generate.insurance.mandate.year = 2006 generate.insurance.mandate.occupation = 0.2 # Default Costs, to be used for pricing something that we don't have a specific price for # -- $500 for procedures is completely invented generate.costs.default_procedure_cost = 1.23 # -- $255 for medications - also invented generate.costs.default_medication_cost = 1.23 # -- Encounters billed using avg prices from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3096340/ # -- Adjustments for initial or subsequent hospital visit and level/complexity/time of encounter # -- not included. Assume initial, low complexity encounter (Tables 4 & 6) generate.costs.default_encounter_cost = 125.00 # -- https://www.nytimes.com/2014/07/03/health/Vaccine-Costs-Soaring-Paying-Till-It-Hurts.html # -- currently all vaccines cost $136. generate.costs.default_immunization_cost = 136.00 # Providers generate.providers.hospitals.default_file = providers/hospitals.csv generate.providers.longterm.default_file = providers/longterm.csv generate.providers.nursing.default_file = providers/nursing.csv generate.providers.rehab.default_file = providers/rehab.csv generate.providers.hospice.default_file = providers/hospice.csv generate.providers.dialysis.default_file = providers/dialysis.csv generate.providers.homehealth.default_file = providers/home_health_agencies.csv generate.providers.veterans.default_file = providers/va_facilities.csv generate.providers.urgentcare.default_file = providers/urgent_care_facilities.csv generate.providers.primarycare.default_file = providers/primary_care_facilities.csv generate.providers.ihs.hospitals.default_file = providers/ihs_facilities.csv generate.providers.ihs.primarycare.default_file = providers/ihs_centers.csv # Provider selection behavior # How patients select a provider organization: # nearest - select the closest provider. See generate.providers.maximum_search_distance # quality - select the best provider if quality is known. Otherwise nearest. # random - select randomly. # network - select the nearest provider in your insurance network. same as random except it changes every time the patient switches insurance provider. generate.providers.selection_behavior = nearest # if a provider cannot be found for a certain type of service, # this will default to the nearest hospital. generate.providers.default_to_hospital_on_failure = true # minimum number of providers linked per patient # if this number is not met it re-runs the simulation generate.providers.minimum = 1 # maximum distance to look for a provider for a given patient, in km # set to 10 degrees lat/lon to support the model that veterans only seek care at VA facilities generate.providers.maximum_search_distance = 32 # Payers generate.payers.insurance_companies.default_file = payers/insurance_companies.csv generate.payers.insurance_companies.medicare = Medicare generate.payers.insurance_companies.medicaid = Medicaid generate.payers.insurance_companies.dual_eligible = Dual Eligible # Payer selection behavior # How patients select a payer: # best_rates - select insurance with best rates for person's existing conditions and medical needs # random - select randomly. generate.payers.selection_behavior = random # Payer adjustment behavior # How payers adjust claims: # none - the payer reimburses each claim by the full amount. # fixed - the payer adjusts each claim by a fixed rate (set by adjustment_rate) # random - the payer adjusts each claim by a random rate (between zero and adjustment_rate). generate.payers.adjustment_behavior = none # Payer adjustment rate should be between zero and one (0.00 - 1.00), where 0.05 is 5%. generate.payers.adjustment_rate = 0.10 # Experimental feature. Patients will miss care if true, but side-effects of missing that care # are not handled. Additionally, the path the disease module might take may no longer make sense. # It might assume things occurred that haven't actually happened it. Use with care. generate.payers.loss_of_care = false # Add a FHIR terminology service URL to enable the use of ValueSet URIs within code definitions. # generate.terminology_service_url = https://r4.ontoserver.csiro.au/fhir # Quit Smoking lifecycle.quit_smoking.baseline = 0.01 lifecycle.quit_smoking.timestep_delta = -0.01 lifecycle.quit_smoking.smoking_duration_factor_per_year = 1.0 # Quit Alcoholism lifecycle.quit_alcoholism.baseline = 0.001 lifecycle.quit_alcoholism.timestep_delta = -0.001 lifecycle.quit_alcoholism.alcoholism_duration_factor_per_year = 1.0 # Adherence lifecycle._direct_transition field physiology.state.enabled = false # set to true to introduce errors in height, weight and BMI observations for people # under 20 years old growtherrors = false ```adherence.baseline = 0.05 # set this to true to enable randomized "death by natural causes" # highly recommended if "only_dead_patients" is true lifecycle.death_by_natural_causes = false # set this to enable "death by loss of care" or missed care, # e.g. not covered by insurance or otherwise unaffordable. # only functional if "generate.payers.loss_of_care" is also true. lifecycle.death_by_loss_of_care = false # Use physiology simulations to generate some VitalSigns physiology.generators.enabled = false # Allow physiology module states to be executed # If false, all Physiology state objects will immediately redirect to the state defined in # the alt_direct_transition field physiology.state.enabled = false # set to true to introduce errors in height, weight and BMI observations for people # under 20 years old growtherrors = false ``` And then I did the following: ```sh (test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt deps 14:58:36 Running with dbt=1.8.7 14:58:37 Installing dbt-labs/dbt_utils 14:58:37 Installed from version 1.3.0 14:58:37 Up to date! (test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: false}" 14:59:23 Running with dbt=1.8.7 14:59:23 Registered adapter: duckdb=1.8.0 14:59:24 Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros ```

Following this, I loaded the vocab files:

(test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: true}"
15:12:50  Running with dbt=1.8.7
15:12:51  Registered adapter: duckdb=1.8.0
15:12:51  Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros
(test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt seed --select states omop
15:13:48  Running with dbt=1.8.7
15:13:48  Registered adapter: duckdb=1.8.0
15:13:49  Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros
15:13:49
15:13:51  Concurrency: 1 threads (target='dev')
15:13:51
15:13:51  1 of 1 START seed file dbt_synthea_dev_map_seeds.states ........................ [RUN]
15:13:52  1 of 1 OK loaded seed file dbt_synthea_dev_map_seeds.states .................... [INSERT 51 in 1.71s]
15:13:52
15:13:52  Finished running 1 seed in 0 hours 0 minutes and 3.67 seconds (3.67s).
15:13:53
15:13:53  Completed successfully
15:13:53
15:13:53  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

And then I ran dbt run but had some tables fail -- not sure what to make of them as well as the suggestions at the end of the run process:

(test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt run
15:13:57  Running with dbt=1.8.7
15:13:57  Registered adapter: duckdb=1.8.0
15:13:58  Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros
15:13:58
15:13:58  Concurrency: 1 threads (target='dev')
15:13:58
15:13:58  1 of 85 START sql table model dbt_synthea_dev.dose_era ......................... [RUN]
15:14:00  1 of 85 OK created sql table model dbt_synthea_dev.dose_era .................... [OK in 1.78s]
15:14:00  2 of 85 START sql table model dbt_synthea_dev.episode .......................... [RUN]
15:14:01  2 of 85 OK created sql table model dbt_synthea_dev.episode ..................... [OK in 1.66s]
15:14:01  3 of 85 START sql table model dbt_synthea_dev.episode_event .................... [RUN]
15:14:03  3 of 85 OK created sql table model dbt_synthea_dev.episode_event ............... [OK in 1.50s]
15:14:03  4 of 85 START sql table model dbt_synthea_dev.fact_relationship ................ [RUN]
15:14:04  4 of 85 OK created sql table model dbt_synthea_dev.fact_relationship ........... [OK in 1.23s]
15:14:04  5 of 85 START sql table model dbt_synthea_dev.metadata ......................... [RUN]
15:14:05  5 of 85 OK created sql table model dbt_synthea_dev.metadata .................... [OK in 1.21s]
15:14:05  6 of 85 START sql table model dbt_synthea_dev.note ............................. [RUN]
15:14:06  6 of 85 OK created sql table model dbt_synthea_dev.note ........................ [OK in 1.29s]
15:14:06  7 of 85 START sql table model dbt_synthea_dev.note_nlp ......................... [RUN]
15:14:08  7 of 85 OK created sql table model dbt_synthea_dev.note_nlp .................... [OK in 1.25s]
15:14:08  8 of 85 START sql table model dbt_synthea_dev.specimen ......................... [RUN]
15:14:09  8 of 85 OK created sql table model dbt_synthea_dev.specimen .................... [OK in 1.26s]
15:14:09  9 of 85 START sql table model dbt_synthea_dev.stg_map__states .................. [RUN]
15:14:10  9 of 85 OK created sql table model dbt_synthea_dev.stg_map__states ............. [OK in 1.23s]
15:14:10  10 of 85 START sql table model dbt_synthea_dev.stg_synthea__allergies .......... [RUN]
15:14:12  10 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__allergies ..... [OK in 1.62s]
15:14:12  11 of 85 START sql table model dbt_synthea_dev.stg_synthea__careplans .......... [RUN]
15:14:14  11 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__careplans ..... [OK in 2.10s]
15:14:14  12 of 85 START sql table model dbt_synthea_dev.stg_synthea__claims ............. [RUN]
15:15:24  12 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__claims ........ [OK in 69.77s]
15:15:24  13 of 85 START sql table model dbt_synthea_dev.stg_synthea__claims_transactions  [RUN]
15:18:07  13 of 85 ERROR creating sql table model dbt_synthea_dev.stg_synthea__claims_transactions  [ERROR in 163.53s]
15:18:07  14 of 85 START sql table model dbt_synthea_dev.stg_synthea__conditions ......... [RUN]
15:18:15  14 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__conditions .... [OK in 8.14s]
15:18:15  15 of 85 START sql table model dbt_synthea_dev.stg_synthea__devices ............ [RUN]
15:18:17  15 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__devices ....... [OK in 1.63s]
15:18:17  16 of 85 START sql table model dbt_synthea_dev.stg_synthea__encounters ......... [RUN]
15:18:33  16 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__encounters .... [OK in 16.23s]
15:18:33  17 of 85 START sql table model dbt_synthea_dev.stg_synthea__imaging_studies .... [RUN]
15:18:50  17 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__imaging_studies  [OK in 16.53s]
15:18:50  18 of 85 START sql table model dbt_synthea_dev.stg_synthea__immunizations ...... [RUN]
15:18:52  18 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__immunizations . [OK in 2.35s]
15:18:52  19 of 85 START sql table model dbt_synthea_dev.stg_synthea__medications ........ [RUN]
15:19:12  19 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__medications ... [OK in 19.96s]
15:19:12  20 of 85 START sql table model dbt_synthea_dev.stg_synthea__observations ....... [RUN]
15:19:33  20 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__observations .. [OK in 21.33s]
15:19:33  21 of 85 START sql table model dbt_synthea_dev.stg_synthea__organizations ...... [RUN]
15:19:35  21 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__organizations . [OK in 1.71s]
15:19:35  22 of 85 START sql table model dbt_synthea_dev.stg_synthea__patients ........... [RUN]
15:19:38  22 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__patients ...... [OK in 2.91s]
15:19:38  23 of 85 START sql table model dbt_synthea_dev.stg_synthea__payer_transitions .. [RUN]
15:19:54  23 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__payer_transitions  [OK in 15.85s]
15:19:54  24 of 85 START sql table model dbt_synthea_dev.stg_synthea__payers ............. [RUN]
15:19:56  24 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__payers ........ [OK in 1.95s]
15:19:56  25 of 85 START sql table model dbt_synthea_dev.stg_synthea__procedures ......... [RUN]
15:20:00  25 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__procedures .... [OK in 4.16s]
15:20:00  26 of 85 START sql table model dbt_synthea_dev.stg_synthea__providers .......... [RUN]
15:20:02  26 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__providers ..... [OK in 1.95s]
15:20:02  27 of 85 START sql table model dbt_synthea_dev.stg_synthea__supplies ........... [RUN]
15:20:04  27 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__supplies ...... [OK in 2.19s]
15:20:04  28 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept ......... [RUN]
15:20:08  28 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept .... [OK in 3.58s]
15:20:08  29 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor  [RUN]
15:20:11  29 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor  [OK in 3.40s]
15:20:11  30 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_class ... [RUN]
15:20:13  30 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_class  [OK in 1.85s]
15:20:13  31 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_relationship  [RUN]
15:20:18  31 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_relationship  [OK in 4.74s]
15:20:18  32 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_synonym . [RUN]
15:20:20  32 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_synonym  [OK in 2.36s]
15:20:20  33 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__domain .......... [RUN]
15:20:22  33 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__domain ..... [OK in 1.77s]
15:20:22  34 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__drug_strength ... [RUN]
15:20:24  34 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__drug_strength  [OK in 2.13s]
15:20:24  35 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__relationship .... [RUN]
15:20:26  35 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__relationship  [OK in 1.80s]
15:20:26  36 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__source_to_concept_map  [RUN]
15:20:26  36 of 85 ERROR creating sql table model dbt_synthea_dev.stg_vocabulary__source_to_concept_map  [ERROR in 0.06s]
15:20:26  37 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__vocabulary ...... [RUN]
15:20:28  37 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__vocabulary . [OK in 1.75s]
15:20:28  38 of 85 START sql table model dbt_synthea_dev.int__er_visits .................. [RUN]
15:20:31  38 of 85 OK created sql table model dbt_synthea_dev.int__er_visits ............. [OK in 3.11s]
15:20:31  39 of 85 START sql table model dbt_synthea_dev.int__ip_visits .................. [RUN]
15:20:35  39 of 85 OK created sql table model dbt_synthea_dev.int__ip_visits ............. [OK in 4.15s]
15:20:35  40 of 85 START sql table model dbt_synthea_dev.int__op_visits .................. [RUN]
15:20:57  40 of 85 OK created sql table model dbt_synthea_dev.int__op_visits ............. [OK in 22.01s]
15:20:57  41 of 85 START sql table model dbt_synthea_dev.care_site ....................... [RUN]
15:20:59  41 of 85 OK created sql table model dbt_synthea_dev.care_site .................. [OK in 1.77s]
15:20:59  42 of 85 START sql table model dbt_synthea_dev.int__person ..................... [RUN]
15:21:01  42 of 85 OK created sql table model dbt_synthea_dev.int__person ................ [OK in 2.49s]
15:21:01  43 of 85 START sql table model dbt_synthea_dev.location ........................ [RUN]
15:21:04  43 of 85 OK created sql table model dbt_synthea_dev.location ................... [OK in 2.64s]
15:21:04  44 of 85 START sql table model dbt_synthea_dev.provider ........................ [RUN]
15:21:06  44 of 85 OK created sql table model dbt_synthea_dev.provider ................... [OK in 1.96s]
15:21:06  45 of 85 START sql table model dbt_synthea_dev.concept ......................... [RUN]
15:21:11  45 of 85 OK created sql table model dbt_synthea_dev.concept .................... [OK in 4.84s]
15:21:11  46 of 85 START sql table model dbt_synthea_dev.int__source_to_source_vocab_map . [RUN]
15:21:16  46 of 85 OK created sql table model dbt_synthea_dev.int__source_to_source_vocab_map  [OK in 5.19s]
15:21:16  47 of 85 START sql table model dbt_synthea_dev.concept_ancestor ................ [RUN]
15:21:21  47 of 85 OK created sql table model dbt_synthea_dev.concept_ancestor ........... [OK in 4.67s]
15:21:21  48 of 85 START sql table model dbt_synthea_dev.concept_class ................... [RUN]
15:21:23  48 of 85 OK created sql table model dbt_synthea_dev.concept_class .............. [OK in 1.81s]
15:21:23  49 of 85 START sql table model dbt_synthea_dev.concept_relationship ............ [RUN]
15:21:28  49 of 85 OK created sql table model dbt_synthea_dev.concept_relationship ....... [OK in 5.81s]
15:21:28  50 of 85 START sql table model dbt_synthea_dev.int__source_to_standard_vocab_map  [RUN]
15:21:34  50 of 85 OK created sql table model dbt_synthea_dev.int__source_to_standard_vocab_map  [OK in 5.33s]
15:21:34  51 of 85 START sql table model dbt_synthea_dev.concept_synonym ................. [RUN]
15:21:37  51 of 85 OK created sql table model dbt_synthea_dev.concept_synonym ............ [OK in 3.53s]
15:21:37  52 of 85 START sql table model dbt_synthea_dev.domain .......................... [RUN]
15:21:39  52 of 85 OK created sql table model dbt_synthea_dev.domain ..................... [OK in 1.79s]
15:21:39  53 of 85 START sql table model dbt_synthea_dev.drug_strength ................... [RUN]
15:21:43  53 of 85 OK created sql table model dbt_synthea_dev.drug_strength .............. [OK in 3.32s]
15:21:43  54 of 85 START sql table model dbt_synthea_dev.relationship .................... [RUN]
15:21:45  54 of 85 OK created sql table model dbt_synthea_dev.relationship ............... [OK in 1.92s]
15:21:45  55 of 85 SKIP relation dbt_synthea_dev.source_to_concept_map ................... [SKIP]
15:21:45  56 of 85 START sql table model dbt_synthea_dev.cdm_source ...................... [RUN]
15:21:46  56 of 85 OK created sql table model dbt_synthea_dev.cdm_source ................. [OK in 1.84s]
15:21:46  57 of 85 START sql table model dbt_synthea_dev.vocabulary ...................... [RUN]
15:21:48  57 of 85 OK created sql table model dbt_synthea_dev.vocabulary ................. [OK in 1.79s]
15:21:48  58 of 85 START sql table model dbt_synthea_dev.int__all_visits ................. [RUN]
15:22:16  58 of 85 OK created sql table model dbt_synthea_dev.int__all_visits ............ [OK in 28.17s]
15:22:16  59 of 85 START sql table model dbt_synthea_dev.person .......................... [RUN]
15:22:19  59 of 85 OK created sql table model dbt_synthea_dev.person ..................... [OK in 2.42s]
15:22:19  60 of 85 START sql table model dbt_synthea_dev.int__encounter_provider ......... [RUN]
15:22:26  60 of 85 OK created sql table model dbt_synthea_dev.int__encounter_provider .... [OK in 6.78s]
15:22:26  61 of 85 START sql table model dbt_synthea_dev.int__drug_immunisations ......... [RUN]
15:22:27  61 of 85 OK created sql table model dbt_synthea_dev.int__drug_immunisations .... [OK in 1.86s]
15:22:27  62 of 85 START sql table model dbt_synthea_dev.int__drug_medications ........... [RUN]
15:22:43  62 of 85 OK created sql table model dbt_synthea_dev.int__drug_medications ...... [OK in 15.90s]
15:22:43  63 of 85 START sql table model dbt_synthea_dev.int__observation_allergies ...... [RUN]
15:22:46  63 of 85 OK created sql table model dbt_synthea_dev.int__observation_allergies . [OK in 2.41s]
15:22:46  64 of 85 START sql table model dbt_synthea_dev.int__observation_conditions ..... [RUN]
15:22:51  64 of 85 OK created sql table model dbt_synthea_dev.int__observation_conditions  [OK in 5.41s]
15:22:51  65 of 85 START sql table model dbt_synthea_dev.int__observation_observations ... [RUN]
15:22:57  65 of 85 OK created sql table model dbt_synthea_dev.int__observation_observations  [OK in 6.05s]
15:22:57  66 of 85 START sql table model dbt_synthea_dev.int__assign_all_visit_ids ....... [RUN]
15:23:14  66 of 85 OK created sql table model dbt_synthea_dev.int__assign_all_visit_ids .. [OK in 17.05s]
15:23:14  67 of 85 START sql table model dbt_synthea_dev.death ........................... [RUN]
15:23:17  67 of 85 OK created sql table model dbt_synthea_dev.death ...................... [OK in 2.77s]
15:23:17  68 of 85 START sql table model dbt_synthea_dev.observation_period .............. [RUN]
15:23:20  68 of 85 OK created sql table model dbt_synthea_dev.observation_period ......... [OK in 2.63s]
15:23:20  69 of 85 START sql table model dbt_synthea_dev.payer_plan_period ............... [RUN]
15:23:52  69 of 85 OK created sql table model dbt_synthea_dev.payer_plan_period .......... [OK in 32.26s]
15:23:52  70 of 85 START sql table model dbt_synthea_dev.int__final_visit_ids ............ [RUN]
15:24:03  70 of 85 OK created sql table model dbt_synthea_dev.int__final_visit_ids ....... [OK in 10.63s]
15:24:03  71 of 85 START sql table model dbt_synthea_dev.condition_occurrence ............ [RUN]
15:24:15  71 of 85 OK created sql table model dbt_synthea_dev.condition_occurrence ....... [OK in 12.92s]
15:24:15  72 of 85 START sql table model dbt_synthea_dev.device_exposure ................. [RUN]
15:24:21  72 of 85 OK created sql table model dbt_synthea_dev.device_exposure ............ [OK in 5.36s]
15:24:21  73 of 85 START sql table model dbt_synthea_dev.drug_exposure ................... [RUN]
15:25:27  73 of 85 OK created sql table model dbt_synthea_dev.drug_exposure .............. [OK in 66.43s]
15:25:27  74 of 85 START sql table model dbt_synthea_dev.measurement ..................... [RUN]
15:26:11  74 of 85 OK created sql table model dbt_synthea_dev.measurement ................ [OK in 43.41s]
15:26:11  75 of 85 START sql table model dbt_synthea_dev.observation ..................... [RUN]
15:26:38  75 of 85 OK created sql table model dbt_synthea_dev.observation ................ [OK in 27.04s]
15:26:38  76 of 85 START sql table model dbt_synthea_dev.procedure_occurrence ............ [RUN]
15:26:47  76 of 85 OK created sql table model dbt_synthea_dev.procedure_occurrence ....... [OK in 8.86s]
15:26:47  77 of 85 START sql table model dbt_synthea_dev.visit_detail .................... [RUN]
15:27:10  77 of 85 OK created sql table model dbt_synthea_dev.visit_detail ............... [OK in 23.57s]
15:27:10  78 of 85 START sql table model dbt_synthea_dev.visit_occurrence ................ [RUN]
15:27:33  78 of 85 OK created sql table model dbt_synthea_dev.visit_occurrence ........... [OK in 23.03s]
15:27:33  79 of 85 START sql table model dbt_synthea_dev.condition_era ................... [RUN]
15:27:40  79 of 85 OK created sql table model dbt_synthea_dev.condition_era .............. [OK in 7.18s]
15:27:40  80 of 85 START sql table model dbt_synthea_dev.drug_era ........................ [RUN]
15:28:43  80 of 85 OK created sql table model dbt_synthea_dev.drug_era ................... [OK in 62.78s]
15:28:43  81 of 85 SKIP relation dbt_synthea_dev.int__cost_condition ..................... [SKIP]
15:28:43  82 of 85 START sql table model dbt_synthea_dev.int__cost_drug_exposure_1 ....... [RUN]
15:28:48  82 of 85 OK created sql table model dbt_synthea_dev.int__cost_drug_exposure_1 .. [OK in 4.84s]
15:28:48  83 of 85 START sql table model dbt_synthea_dev.int__cost_drug_exposure_2 ....... [RUN]
15:29:21  83 of 85 OK created sql table model dbt_synthea_dev.int__cost_drug_exposure_2 .. [OK in 32.66s]
15:29:21  84 of 85 START sql table model dbt_synthea_dev.int__cost_procedure ............. [RUN]
15:29:29  84 of 85 OK created sql table model dbt_synthea_dev.int__cost_procedure ........ [OK in 7.95s]
15:29:29  85 of 85 SKIP relation dbt_synthea_dev.cost .................................... [SKIP]
15:29:29
15:29:29  Finished running 85 table models in 0 hours 15 minutes and 31.08 seconds (931.08s).
15:29:29
15:29:29  Completed with 2 errors and 0 warnings:
15:29:29
15:29:29    Runtime Error in model stg_synthea__claims_transactions (models/staging/synthea/stg_synthea__claims_transactions.sql)
  Out of Memory Error: failed to offload data block of size 256.0 KiB (16383.9 PiB/1.2 TiB used).
  This limit was set by the 'max_temp_directory_size' setting.
  By default, this setting utilizes the available disk space on the drive where the 'temp_directory' is located.
  You can adjust this setting, by using (for example) PRAGMA max_temp_directory_size='10GiB'
15:29:29
15:29:29    Runtime Error in model stg_vocabulary__source_to_concept_map (models/staging/vocabulary/stg_vocabulary__source_to_concept_map.sql)
  Parser Error: SELECT clause without selection list
15:29:29
15:29:29  Done. PASS=80 WARN=0 ERROR=2 SKIP=3 TOTAL=85

For reference, here are the size of the files I used:

146M allergies.csv
354M careplans.csv
 28G claims.csv
136G claims_transactions.csv
3.4G conditions.csv
 19M devices.csv
9.8G encounters.csv
 11G imaging_studies.csv
225M immunizations.csv
 11G medications.csv
 11G observations.csv
1.3M organizations.csv
317M patients.csv
2.5K payers.csv
8.3G payer_transitions.csv
1.7G procedures.csv
 11M providers.csv
183M supplies.csv

---

total 218G

@lawrenceadams

So there are two approaches:

we could either modify the loader script to glob files , or...

more simply - given duckdb doesn't need files loaded before it can use them (@katy-sadowski might correct me - but my understanding as to why we have a step where we actively load files into the pipeline is to keep the source.yml files identical between database systems e.g. Postgres/Duckdb/etc can use the same code), you could modify a sources.yml file to incorporate something like the below:

[...]

Does that make sense? It blew my mind the first time I realised I could do this! Makes working with partitioned CSV/Parquet files so much easier 😄

It does because, looking at my above comment, I now realize the problem is not so much the loading of data but in the ETL step -- I was unclear where the error was coming from but I have narrowed it above where you can see ETL'ing fails.

So, what I would like is to load multiple patient files chunked however I want (by like 100K patients per data generation) and then ETL those patients per data generation into a final duckdb file that continues to grow as I continue adding in new patients. As it stands right now, my understanding is that it is only one shot. I cannot add new patients to an old duckdb database made with dbt-synthea after I run dbt run initially.

Is that helping to frame more my problem?

Yeah this makes sense - I think a problem is that in this case the duckdb query planner fails and cannot figure out what is going on anymore:

Out of Memory Error: failed to offload data block of size 256.0 KiB (16383.9 PiB/1.2 TiB used).

Looks like the query planner has gone berserk there - unsure what is going on... Could you share what platform you're on and how much RAM you have?

A fun problem 😆

I'm going to try tomorrow - let me know if you have any breakthroughs!

Could you share what platform you're on and how much RAM you have?

Fedora 39 (Desktop) RAM: 64GB Available

Looks like the query planner has gone berserk there - unsure what is going on...

Out of curiosity, is my intuition here that chunking up how many patients I am ETL'ing each time a good thought? Or is this problem unrelated to what solution I wanted sketched out above?

I'd have thought that would be enough! Interesting... 🤔

Yes that's definitely a way of doing it! Two ways of doing this spring to my mind:

Export each OMOP (final models) at the end of each run to file e.g. parquet; and then stitch them together at the end / leave as partitioned files per se. [This is very janky - but I suppose one appraoch]
The more logical approach would be to use incremental models and dbt run 10 times with the sources being different each time

Not sure if either is super ideal but it is possible!

One thing I hadn't appreciated is that every model gets materialised as a table - in reality - especially when using huge source sizes like we are now - it might be best to set the staging models (which do nothing but project ) to materialize as views:

https://github.com/OHDSI/dbt-synthea/blob/8ff89117b55e1186caeb9e451461aa0f5dcc4877/dbt_project.yml#L20-L22

Otherwise we will have 4 copies of the data (loaded / staged / intermediate / omop)!

@TheCedarPrince I got a bit further and then realised I had forgot to seed in the vocabulary 🙃

It did manage to get stg_synthea__claims_transactions to work though - unsure how best to fix!

dbt run --fail-fast
17:53:10  Running with dbt=1.8.7
17:53:11  Registered adapter: duckdb=1.8.0
17:53:11  Found 86 models, 29 seeds, 425 data tests, 29 sources, 538 macros
17:53:11  
17:53:11  Concurrency: 1 threads (target='dev')
17:53:11  
17:53:11  1 of 86 START sql table model dbt_synthea_dev.dose_era ......................... [RUN]
17:53:13  1 of 86 OK created sql table model dbt_synthea_dev.dose_era .................... [OK in 1.73s]
17:53:13  2 of 86 START sql table model dbt_synthea_dev.episode .......................... [RUN]
17:53:15  2 of 86 OK created sql table model dbt_synthea_dev.episode ..................... [OK in 1.80s]
17:53:15  3 of 86 START sql table model dbt_synthea_dev.episode_event .................... [RUN]
17:53:16  3 of 86 OK created sql table model dbt_synthea_dev.episode_event ............... [OK in 1.63s]
17:53:16  4 of 86 START sql table model dbt_synthea_dev.fact_relationship ................ [RUN]
17:53:18  4 of 86 OK created sql table model dbt_synthea_dev.fact_relationship ........... [OK in 1.61s]
17:53:18  5 of 86 START sql table model dbt_synthea_dev.metadata ......................... [RUN]
17:53:20  5 of 86 OK created sql table model dbt_synthea_dev.metadata .................... [OK in 1.81s]
17:53:20  6 of 86 START sql table model dbt_synthea_dev.note ............................. [RUN]
17:53:21  6 of 86 OK created sql table model dbt_synthea_dev.note ........................ [OK in 1.63s]
17:53:21  7 of 86 START sql table model dbt_synthea_dev.note_nlp ......................... [RUN]
17:53:23  7 of 86 OK created sql table model dbt_synthea_dev.note_nlp .................... [OK in 1.64s]
17:53:23  8 of 86 START sql table model dbt_synthea_dev.specimen ......................... [RUN]
17:53:25  8 of 86 OK created sql table model dbt_synthea_dev.specimen .................... [OK in 1.78s]
17:53:25  9 of 86 START sql table model dbt_synthea_dev.stg_map__states .................. [RUN]
17:53:26  9 of 86 OK created sql table model dbt_synthea_dev.stg_map__states ............. [OK in 1.64s]
17:53:26  10 of 86 START sql table model dbt_synthea_dev.stg_synthea__allergies .......... [RUN]
17:53:30  10 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__allergies ..... [OK in 3.08s]
17:53:30  11 of 86 START sql table model dbt_synthea_dev.stg_synthea__careplans .......... [RUN]
17:53:33  11 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__careplans ..... [OK in 3.35s]
17:53:33  12 of 86 START sql table model dbt_synthea_dev.stg_synthea__claims ............. [RUN]
17:54:25  12 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__claims ........ [OK in 52.08s]
17:54:25  13 of 86 START sql table model dbt_synthea_dev.stg_synthea__claims_transactions  [RUN]
17:58:35  13 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__claims_transactions  [OK in 249.68s]
17:58:35  14 of 86 START sql table model dbt_synthea_dev.stg_synthea__conditions ......... [RUN]
17:58:40  14 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__conditions .... [OK in 5.35s]
17:58:40  15 of 86 START sql table model dbt_synthea_dev.stg_synthea__devices ............ [RUN]
17:58:43  15 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__devices ....... [OK in 2.92s]
17:58:43  16 of 86 START sql table model dbt_synthea_dev.stg_synthea__encounters ......... [RUN]
17:58:52  16 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__encounters .... [OK in 8.92s]
17:58:52  17 of 86 START sql table model dbt_synthea_dev.stg_synthea__imaging_studies .... [RUN]
17:59:12  17 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__imaging_studies  [OK in 19.78s]
17:59:12  18 of 86 START sql table model dbt_synthea_dev.stg_synthea__immunizations ...... [RUN]
17:59:15  18 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__immunizations . [OK in 3.68s]
17:59:15  19 of 86 START sql table model dbt_synthea_dev.stg_synthea__medications ........ [RUN]
17:59:23  19 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__medications ... [OK in 7.71s]
17:59:23  20 of 86 START sql table model dbt_synthea_dev.stg_synthea__observations ....... [RUN]
18:00:08  20 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__observations .. [OK in 44.97s]
18:00:08  21 of 86 START sql table model dbt_synthea_dev.stg_synthea__organizations ...... [RUN]
18:00:11  21 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__organizations . [OK in 3.00s]
18:00:11  22 of 86 START sql table model dbt_synthea_dev.stg_synthea__patients ........... [RUN]
18:00:15  22 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__patients ...... [OK in 3.84s]
18:00:15  23 of 86 START sql table model dbt_synthea_dev.stg_synthea__payer_transitions .. [RUN]
18:00:23  23 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__payer_transitions  [OK in 7.10s]
18:00:23  24 of 86 START sql table model dbt_synthea_dev.stg_synthea__payers ............. [RUN]
18:00:26  24 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__payers ........ [OK in 2.98s]
18:00:26  25 of 86 START sql table model dbt_synthea_dev.stg_synthea__procedures ......... [RUN]
18:00:34  25 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__procedures .... [OK in 8.31s]
18:00:34  26 of 86 START sql table model dbt_synthea_dev.stg_synthea__providers .......... [RUN]
18:00:37  26 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__providers ..... [OK in 3.17s]
18:00:37  27 of 86 START sql table model dbt_synthea_dev.stg_synthea__supplies ........... [RUN]
18:00:40  27 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__supplies ...... [OK in 3.35s]
18:00:40  28 of 86 START sql table model dbt_synthea_dev.stg_vocabulary__concept ......... [RUN]
18:00:40  28 of 86 ERROR creating sql table model dbt_synthea_dev.stg_vocabulary__concept  [ERROR in 0.06s]
18:00:40  29 of 86 START sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor  [RUN]
18:00:41  CANCEL query model.synthea_omop_etl.stg_vocabulary__concept_ancestor ........... [CANCEL]
18:00:41  29 of 86 ERROR creating sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor  [ERROR in 0.04s]
18:00:41  
18:00:41    Runtime Error in model stg_vocabulary__concept (models/staging/vocabulary/stg_vocabulary__concept.sql)
  Parser Error: SELECT clause without selection list
18:00:41  
18:00:41  Finished running 86 table models in 0 hours 7 minutes and 29.65 seconds (449.65s).
18:00:41  
18:00:41  Completed with 59 errors and 0 warnings:
18:00:41  
18:00:41    Runtime Error in model stg_vocabulary__concept (models/staging/vocabulary/stg_vocabulary__concept.sql)
  Parser Error: SELECT clause without selection list
18:00:41

Export each OMOP (final models) at the end of each run to file e.g. parquet; and then stitch them together at the end / leave as partitioned files per se. [This is very janky - but I suppose one appraoch]

This actually is really quite reasonable for a naive and quick approach. I'll try that at some point probably tomorrow. As you know, it would have drawbacks of duplicating lots of information -- I immediately think of the vocab tables.

The more logical approach would be to use incremental models and dbt run 10 times with the sources being different each time

Hunh! Not sure what is involved in that approach but seems interesting!

Otherwise we will have 4 copies of the data (loaded / staged / intermediate / omop)!

AH! That's why my persistent DuckDB database file keeps getting so large! Makes sense.

18:00:41 CANCEL query model.synthea_omop_etl.stg_vocabularyconcept_ancestor ........... [CANCEL] 18:00:41 29 of 86 ERROR creating sql table model dbt_synthea_dev.stg_vocabularyconcept_ancestor [ERROR in 0.04s] 18:00:41
18:00:41 Runtime Error in model stg_vocabularyconcept (models/staging/vocabulary/stg_vocabulary__concept.sql) Parser Error: SELECT clause without selection list 18:00:41
18:00:41 Finished running 86 table models in 0 hours 7 minutes and 29.65 seconds (449.65s). 18:00:41
18:00:41 Completed with 59 errors and 0 warnings: 18:00:41
18:00:41 Runtime Error in model stg_vocabularyconcept (models/staging/vocabulary/stg_vocabulary__concept.sql) Parser Error: SELECT clause without selection list 18:00:41

Yea, I am unsure what this means as well...

Runtime Error in model stg_vocabularysource_to_concept_map (models/staging/vocabulary/stg_vocabularysource_to_concept_map.sql) Parser Error: SELECT clause without selection list

@TheCedarPrince can you check the compiled SQL for this?

Regarding running dbt in batches, we need to be careful because certain tables are "shared" across patients like care_site, location, and provider. I'm guessing dbt's incremental mode would handle this sort of thing? But I haven't looked into it much. My overall take is I'd prefer to avoid these sorts of approaches if at all possible and rather understand:

if there are things going on in the SQL queries as written that are consuming lots of memory
what is even reasonable to expect for a local run on large data

As I understand it, incremental mode was designed for huge datasets with the need for very frequent updates, which is not the typical use case for OMOP CDMs.

Finally, regarding the duplication of data across stage/int/marts - that's what I hope to address in #38 . Keeping all those tables around is only useful for debugging purposes while developing the ETL.

Re: incremental, you're right: it's really made for updates every n hours/days etc.

If duckdb is unable to cope then we'd probably need a different database engine in this insistence, but this does seem like an extreme stress test/case.

I'd test on Postgres but it'd almost definitely run out of space

I have (with some tuning) eventually managed to get every table build... except observations which runs out of memory sadly for me.

observations.sql has a lot of complex joins using big tables and this is what can upset duckdb looking through the docs.

Here is the query plan for it for reference:

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│       observation_id      │
│         person_id         │
│   observation_concept_id  │
│      observation_date     │
│    observation_datetime   │
│observation_type_concept_id│
│            NULL           │
│            NULL           │
│    value_as_concept_id    │
│    qualifier_concept_id   │
│      unit_concept_id      │
│        provider_id        │
│    visit_occurrence_id    │
│      visit_detail_id      │
│  observation_source_value │
│observation_source_concept_│
│             id            │
│            NULL           │
│            NULL           │
│            NULL           │
│            NULL           │
│            NULL           │
│                           │
│      ~319014617 Rows      │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             #0            │
│             #1            │
│             #2            │
│             #3            │
│             #4            │
│             #5            │
│             #6            │
│             #7            │
│             #8            │
│             #9            │
│            #10            │
│            #11            │
│            #12            │
│            #13            │
│            #14            │
│            #15            │
│                           │
│      ~319014617 Rows      │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           WINDOW          │
│    ────────────────────   │
│        Projections:       │
│  ROW_NUMBER() OVER (ORDER │
│   BY person_id ASC NULLS  │
│            LAST)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         HASH_JOIN         │
│    ────────────────────   │
│      Join Type: INNER     │
│                           │
│        Conditions:        ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│        patient_id =       │                                                                                                                                  │
│     person_source_value   │                                                                                                                                  │
│                           │                                                                                                                                  │
│      ~319014617 Rows      │                                                                                                                                  │
└─────────────┬─────────────┘                                                                                                                                  │
┌─────────────┴─────────────┐                                                                                                                    ┌─────────────┴─────────────┐
│         PROJECTION        │                                                                                                                    │         SEQ_SCAN          │
│    ────────────────────   │                                                                                                                    │    ────────────────────   │
│             #0            │                                                                                                                    │           person          │
│             #1            │                                                                                                                    │                           │
│             #2            │                                                                                                                    │        Projections:       │
│             #3            │                                                                                                                    │    person_source_value    │
│             #4            │                                                                                                                    │         person_id         │
│             #5            │                                                                                                                    │                           │
│             #6            │                                                                                                                    │                           │
│             #7            │                                                                                                                    │                           │
│             #8            │                                                                                                                    │                           │
│             #9            │                                                                                                                    │                           │
│            #10            │                                                                                                                    │                           │
│            #11            │                                                                                                                    │                           │
│__internal_decompress_integ│                                                                                                                    │                           │
│     ral_bigint(#12, 1)    │                                                                                                                    │                           │
│                           │                                                                                                                    │                           │
│      ~253445155 Rows      │                                                                                                                    │       ~1144346 Rows       │
└─────────────┬─────────────┘                                                                                                                    └───────────────────────────┘
┌─────────────┴─────────────┐
│         HASH_JOIN         │
│    ────────────────────   │
│      Join Type: LEFT      │
│                           │
│        Conditions:        ├─────────────────────────────────────────────────────────────────────────────────────────────────────┐
│encounter_id = encounter_id│                                                                                                     │
│  patient_id = patient_id  │                                                                                                     │
│                           │                                                                                                     │
│      ~253445155 Rows      │                                                                                                     │
└─────────────┬─────────────┘                                                                                                     │
┌─────────────┴─────────────┐                                                                                       ┌─────────────┴─────────────┐
│         PROJECTION        │                                                                                       │         PROJECTION        │
│    ────────────────────   │                                                                                       │    ────────────────────   │
│             #0            │                                                                                       │             #0            │
│             #1            │                                                                                       │             #1            │
│             #2            │                                                                                       │__internal_compress_integra│
│             #3            │                                                                                       │     l_usmallint(#2, 1)    │
│             #4            │                                                                                       │                           │
│             #5            │                                                                                       │                           │
│             #6            │                                                                                       │                           │
│             #7            │                                                                                       │                           │
│             #8            │                                                                                       │                           │
│__internal_decompress_integ│                                                                                       │                           │
│     ral_bigint(#9, 1)     │                                                                                       │                           │
│                           │                                                                                       │                           │
│      ~253445155 Rows      │                                                                                       │       ~58246069 Rows      │
└─────────────┬─────────────┘                                                                                       └─────────────┬─────────────┘
┌─────────────┴─────────────┐                                                                                       ┌─────────────┴─────────────┐
│         HASH_JOIN         │                                                                                       │         SEQ_SCAN          │
│    ────────────────────   │                                                                                       │    ────────────────────   │
│      Join Type: LEFT      │                                                                                       │  int__encounter_provider  │
│                           │                                                                                       │                           │
│        Conditions:        │                                                                                       │        Projections:       │
│encounter_id = encounter_id├────────────────────────────────────────────────────────────────────────┐              │        encounter_id       │
│                           │                                                                        │              │         patient_id        │
│                           │                                                                        │              │        provider_id        │
│                           │                                                                        │              │                           │
│      ~253445155 Rows      │                                                                        │              │       ~58246069 Rows      │
└─────────────┬─────────────┘                                                                        │              └───────────────────────────┘
┌─────────────┴─────────────┐                                                          ┌─────────────┴─────────────┐
│           UNION           │                                                          │         PROJECTION        │
│                           │                                                          │    ────────────────────   │
│                           │                                                          │             #0            │
│                           ├───────────────────────────────────────────┐              │__internal_compress_integra│
│                           │                                           │              │     l_uinteger(#1, 1)     │
│                           │                                           │              │                           │
│                           │                                           │              │       ~56156413 Rows      │
└─────────────┬─────────────┘                                           │              └─────────────┬─────────────┘
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         PROJECTION        │                             │         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   │                             │    ────────────────────   ││    ────────────────────   │
│             #0            │                             │int__observation_observatio││    int__final_visit_ids   │
│             #1            │                             │             ns            ││                           │
│             #2            │                             │                           ││        Projections:       │
│             #3            │                             │        Projections:       ││        encounter_id       │
│   CAST(#4 AS TIMESTAMP)   │                             │         patient_id        ││  visit_occurrence_id_new  │
│             #5            │                             │        encounter_id       ││                           │
│             #6            │                             │   observation_concept_id  ││                           │
│             #7            │                             │      observation_date     ││                           │
│                           │                             │    observation_datetime   ││                           │
│                           │                             │observation_type_concept_id││                           │
│                           │                             │  observation_source_value ││                           │
│                           │                             │observation_source_concept_││                           │
│                           │                             │             id            ││                           │
│                           │                             │                           ││                           │
│       ~29567613 Rows      │                             │      ~223877542 Rows      ││       ~56156413 Rows      │
└─────────────┬─────────────┘                             └───────────────────────────┘└───────────────────────────┘
┌─────────────┴─────────────┐
│           UNION           ├──────────────┐
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│ int__observation_allergies││int__observation_conditions│
│                           ││                           │
│        Projections:       ││        Projections:       │
│         patient_id        ││         patient_id        │
│        encounter_id       ││        encounter_id       │
│   observation_concept_id  ││   observation_concept_id  │
│      observation_date     ││      observation_date     │
│    observation_datetime   ││    observation_datetime   │
│observation_type_concept_id││observation_type_concept_id│
│  observation_source_value ││  observation_source_value │
│observation_source_concept_││observation_source_concept_│
│             id            ││             id            │
│                           ││                           │
│        ~719717 Rows       ││       ~28847896 Rows      │
└───────────────────────────┘└───────────────────────────┘

It seems like DuckDB will try to spill to disk where it can, including for operations such as joins/orders/windowing it is unable to - but the docs note that if you many of these it may fail to use the disk - which is what I imagine is happening here!

I had some success running in Snowflake (XS Warehouse) - the same queries causing it to fail locally are causing large spill to local warehouse storage. In total took about 20 minutes to run the whole project (without running tests) (~0.3 credits ~ $0.80 pre-tax 💸 )

These are the queries with their metrics if of interest! Metrics (CSV)

I will try to dig in and see what the worst offenders are, but it is measurements/observations at a glance; both of which have large joins with few predicates as far as I can see!

Runtime Error in model stg_vocabularysource_to_concept_map (models/staging/vocabulary/stg_vocabularysource_to_concept_map.sql) Parser Error: SELECT clause without selection list

I realized what this is. The vocab download doesn't come with a source_to_concept_map (duh). I'll file a ticket to handle this table in BYO vocab mode.

Also, I'm looking at the observation SQL. I'm surprised to see there's nothing crazy going on in there. This AND might not be needed, and we could potentially fold that encounters/providers mapper into the visit int table to skip a join.

@lawrenceadams , what were the runtimes for the other models that succeeded / which were the slowest?

Apologies @katy-sadowski I ran out of time to dig in to the results over the weekend!

These are the 5 slowest models, and generally the ones that needed the most out of memory processing (arguably a moot point as we could reduce the number of threads [threads=8] / use a bigger warehouse - but still interesting).

Query ID	Model	Total Elapsed Time (s)	Bytes Scanned (GB)	% Scanned from Cache	Bytes Written to Result (GB)	Rows Produced	Rows Written to Result	Bytes (GB) spilled to Local Storage	Execution Time (ms)
01b7f2aa-0000-673b-0000-f5a10005ea16	observation	754,447 (~12.5 min)	4.1 GB	92.30%	3.3 GB	253,447,438	1	35.693396	753,980 ms (~12.5 min)
01b7f2aa-0000-674d-0000-f5a100063052	visit_detail	370,289 (~6.2 min)	3.4 GB	82%	2.0 GB	54,933,090	1	16.240853	369,871 ms (~6.2 min)
01b7f2aa-0000-6713-0000-f5a100062362	visit_occurrence	368,723 (~6.1 min)	3.4 GB	83.60%	1.9 GB	54,933,090	1	15.9126733	368,252 ms (~6.1 min)
01b7f2aa-0000-673b-0000-f5a10005ea1e	procedure_occurrence	345,095 (~5.8 min)	3.9 GB	57.70%	1.3 GB	45,071,224	1	20.5998039	344,142 ms (~5.8 min)
01b7f2b0-0000-6714-0000-f5a1000614a6	int__cost_drug_exposure_2	319,436 (~5.3 min)	3.8 GB	79.20%	0.8 GB	53,469,529	1	10.4926577	318,674 ms (~5.3 min)

Looking at the most expensive nodes in observation:

This join is the most expensive, followed by the one you linked to, followed by the window function

I agree @katy-sadowski that second join predicate looks redundant - although interestingly when running that model by itself with and without the join on the patient ID results in 79 fewer rows being returned (253447517 [join on encounter_id] - 253447438 [join on both encounter and patient ID] = 79), which is interesting... I'll need to inspect why!

When running the model by itself they're much faster (Only done on observation but took 2 minutes instead of 12) - at some point I'll run on snowflake with one thread to see what happens; it will be more useful!

Interesting thanks @lawrenceadams !

This join is the most expensive, followed by the one you linked to, followed by the window function

In this case I wonder if it would be faster to move the joins down into the individual intermediate tables.

interestingly when running that model by itself with and without the join on the patient ID results in 79 fewer rows being returned (253447517 [join on encounter_id] - 253447438 [join on both encounter and patient ID] = 79)

weird - agree that should be checked! are there encounters missing a patient ID in your source data? (there aren't any in the seed dataset). if so we probably want to handle that scenario explicitly in the ETL.

When running the model by itself they're much faster (Only done on observation but took 2 minutes instead of 12) - at some point I'll run on snowflake with one thread to see what happens; it will be more useful!

you mean dbt run --model observation?

In this case I wonder if it would be faster to move the joins down into the individual intermediate tables.

Quite possibly, from memory there is no filtering that happens at any point so we'd have to pay the price at some point. Maybe worth trying!

When I've done this on real EHR data I tend to have these joins consolidated before the end, and use the final models to do things like global transformations (e.g. join vocabulary where needed) - but we only do that once. I wonder how others in this space handle it!

weird - agree that should be checked! are there encounters missing a patient ID in your source data? (there aren't any in the seed dataset). if so we probably want to handle that scenario explicitly in the ETL.

Great shout - I did this and there are 79 cases which violate this assumption! No idea how this happens as I can't imagine it's a feature of Synthea?

It might be worth re-building the dataset and seeing if it happens for others. Can others replicate?

you mean dbt run --model observation?

if effect yep, I re ran it with dbt run --threads 1 just now and it was similar (measurement, observation, int__assign_all_visit_ids, int__cost_drug_exposure_2, int__cost_procedure were the top 5 offenders!)

When I've done this on real EHR data I tend to have these joins consolidated before the end, and use the final models to do things like global transformations (e.g. join vocabulary where needed) - but we only do that once. I wonder how others in this space handle it!

Maybe a good discussion topic for our next meeting.

It might be worth re-building the dataset and seeing if it happens for others. Can others replicate?

I will try. I'm guessing it's a bug in Synthea (or maybe a weird feature like this one). Once I've got my mega dataset I can start experimenting with some of these optimization ideas too :)

I can try replicating! What do you want me to try doing @lawrenceadams ?

Maybe a good discussion topic for our next meeting.

Great idea!

I can try replicating! What do you want me to try doing @lawrenceadams ?

Amazing!! Did you manage to make a huge synthea dataset? Is it possible to check if encounters have a different patient ID attached to them, as above?

I can try replicating! What do you want me to try doing @lawrenceadams ?

Amazing!! Did you manage to make a huge synthea dataset? Is it possible to check if encounters have a different patient ID attached to them, as above?

So you want me to generate a huge synthea dataset? Would 1 million patients with 3 or 5 year retrospective be enough to be "huge" for you @lawrenceadams?

I can generate that and then check out the different patient ID. :)

OHDSI / dbt-synthea

[NOTES] Experiences with dbt-synthea for Big Patient Datasets #91