Open TheCedarPrince opened 4 weeks ago
Fantastic, thanks @TheCedarPrince ! I think I'll try it on my end as well so we have 2x OS & laptop worth of data on performance :)
Nice work @TheCedarPrince ! I was playing around with this in Snowflake over the weekend and generated a million patients and started running it on their to see how it would benchmark compared to duckdb - keen to hear how you get on!
Question for folks:
Can I run dbt-synthea on "chunks" of data I have manually split up (Meaning, I have folders like csvs_1
, csvs_2
, etc. and for each folder, have about 200K patients)? And then run dbt-synthea against each of these folders without rebuilding the prior DuckDB database? Was wondering as then I can chunk out the ETL simply this way.
Also, @katy-sadowski , I was going to tinker a little later with also getting hyperfine to work with this process for benchmarking purposes. It's considered the state of the art for benchmarking and is used a ton in numerical or HPC environments -- especially like high energy physics areas. Ran into some problems yesterday when I tried to naively run it so there might be some system environment quirks.
Can I run dbt-synthea on "chunks" of data I have manually split up (Meaning, I have folders like
csvs_1
,csvs_2
, etc. and for each folder, have about 200K patients)? And then run dbt-synthea against each of these folders without rebuilding the prior DuckDB database? Was wondering as then I can chunk out the ETL simply this way.
Yeah you definitely can with duckdb with next to no effort - you'd have to change the sources to glob like this!
Interesting -- what files would need to be changed to accommodate this in dbt-synthea
? Because I can load vocab now fine in #90 and mostly the loading of data is fine, but the problem I have, as mentioned:
[ETLing] this data in the step dbt run fails on large tables in the OMOP Schema -- especially those tables which have a large lineage.
So I would want to dbt run
multiple times versus loading. Or do you see the load step as helping with the ETL step @lawrenceadams ?
So there are two approaches:
source.yml
files identical between database systems e.g. Postgres/Duckdb/etc can use the same code), you could modify a sources.yml
file to incorporate something like the below:I would modify the sources file below
to:
version: 2
sources:
- name: synthea
meta:
external_location: "/path/to/synthea/output/{name}/*.csv"
tables:
- name: allergies
- name: careplans
- name: claims_transactions
- name: claims
- name: conditions
- name: devices
- name: encounters
- name: imaging_studies
- name: immunizations
- name: medications
- name: observations
- name: organizations
- name: patients
- name: payer_transitions
- name: payers
- name: procedures
- name: providers
- name: supplies
and it should work without having to have an explicit load step π
Does that make sense? It blew my mind the first time I realised I could do this! Makes working with partitioned CSV/Parquet files so much easier π
I tried loading in the chonkiest of tables (claims transactions) with success - although this took a fair amount of time on a high end Mac...
Latest updates @lawrenceadams and @katy-sadowski :
I generated about one million synthetic patients (with some dead patients included) each with 1 year lookback using Synthea (see my synthea.properties
file below):
Following this, I loaded the vocab files:
(test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt run-operation load_data_duckdb --args "{file_dict: $file_dict, vocab_tables: true}"
15:12:50 Running with dbt=1.8.7
15:12:51 Registered adapter: duckdb=1.8.0
15:12:51 Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros
(test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt seed --select states omop
15:13:48 Running with dbt=1.8.7
15:13:48 Registered adapter: duckdb=1.8.0
15:13:49 Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros
15:13:49
15:13:51 Concurrency: 1 threads (target='dev')
15:13:51
15:13:51 1 of 1 START seed file dbt_synthea_dev_map_seeds.states ........................ [RUN]
15:13:52 1 of 1 OK loaded seed file dbt_synthea_dev_map_seeds.states .................... [INSERT 51 in 1.71s]
15:13:52
15:13:52 Finished running 1 seed in 0 hours 0 minutes and 3.67 seconds (3.67s).
15:13:53
15:13:53 Completed successfully
15:13:53
15:13:53 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
And then I ran dbt run
but had some tables fail -- not sure what to make of them as well as the suggestions at the end of the run process:
(test) thecedarprince@thecedarledge:~/FOSS/dbt-synthea$ dbt run
15:13:57 Running with dbt=1.8.7
15:13:57 Registered adapter: duckdb=1.8.0
15:13:58 Found 85 models, 29 seeds, 425 data tests, 29 sources, 537 macros
15:13:58
15:13:58 Concurrency: 1 threads (target='dev')
15:13:58
15:13:58 1 of 85 START sql table model dbt_synthea_dev.dose_era ......................... [RUN]
15:14:00 1 of 85 OK created sql table model dbt_synthea_dev.dose_era .................... [OK in 1.78s]
15:14:00 2 of 85 START sql table model dbt_synthea_dev.episode .......................... [RUN]
15:14:01 2 of 85 OK created sql table model dbt_synthea_dev.episode ..................... [OK in 1.66s]
15:14:01 3 of 85 START sql table model dbt_synthea_dev.episode_event .................... [RUN]
15:14:03 3 of 85 OK created sql table model dbt_synthea_dev.episode_event ............... [OK in 1.50s]
15:14:03 4 of 85 START sql table model dbt_synthea_dev.fact_relationship ................ [RUN]
15:14:04 4 of 85 OK created sql table model dbt_synthea_dev.fact_relationship ........... [OK in 1.23s]
15:14:04 5 of 85 START sql table model dbt_synthea_dev.metadata ......................... [RUN]
15:14:05 5 of 85 OK created sql table model dbt_synthea_dev.metadata .................... [OK in 1.21s]
15:14:05 6 of 85 START sql table model dbt_synthea_dev.note ............................. [RUN]
15:14:06 6 of 85 OK created sql table model dbt_synthea_dev.note ........................ [OK in 1.29s]
15:14:06 7 of 85 START sql table model dbt_synthea_dev.note_nlp ......................... [RUN]
15:14:08 7 of 85 OK created sql table model dbt_synthea_dev.note_nlp .................... [OK in 1.25s]
15:14:08 8 of 85 START sql table model dbt_synthea_dev.specimen ......................... [RUN]
15:14:09 8 of 85 OK created sql table model dbt_synthea_dev.specimen .................... [OK in 1.26s]
15:14:09 9 of 85 START sql table model dbt_synthea_dev.stg_map__states .................. [RUN]
15:14:10 9 of 85 OK created sql table model dbt_synthea_dev.stg_map__states ............. [OK in 1.23s]
15:14:10 10 of 85 START sql table model dbt_synthea_dev.stg_synthea__allergies .......... [RUN]
15:14:12 10 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__allergies ..... [OK in 1.62s]
15:14:12 11 of 85 START sql table model dbt_synthea_dev.stg_synthea__careplans .......... [RUN]
15:14:14 11 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__careplans ..... [OK in 2.10s]
15:14:14 12 of 85 START sql table model dbt_synthea_dev.stg_synthea__claims ............. [RUN]
15:15:24 12 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__claims ........ [OK in 69.77s]
15:15:24 13 of 85 START sql table model dbt_synthea_dev.stg_synthea__claims_transactions [RUN]
15:18:07 13 of 85 ERROR creating sql table model dbt_synthea_dev.stg_synthea__claims_transactions [ERROR in 163.53s]
15:18:07 14 of 85 START sql table model dbt_synthea_dev.stg_synthea__conditions ......... [RUN]
15:18:15 14 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__conditions .... [OK in 8.14s]
15:18:15 15 of 85 START sql table model dbt_synthea_dev.stg_synthea__devices ............ [RUN]
15:18:17 15 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__devices ....... [OK in 1.63s]
15:18:17 16 of 85 START sql table model dbt_synthea_dev.stg_synthea__encounters ......... [RUN]
15:18:33 16 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__encounters .... [OK in 16.23s]
15:18:33 17 of 85 START sql table model dbt_synthea_dev.stg_synthea__imaging_studies .... [RUN]
15:18:50 17 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__imaging_studies [OK in 16.53s]
15:18:50 18 of 85 START sql table model dbt_synthea_dev.stg_synthea__immunizations ...... [RUN]
15:18:52 18 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__immunizations . [OK in 2.35s]
15:18:52 19 of 85 START sql table model dbt_synthea_dev.stg_synthea__medications ........ [RUN]
15:19:12 19 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__medications ... [OK in 19.96s]
15:19:12 20 of 85 START sql table model dbt_synthea_dev.stg_synthea__observations ....... [RUN]
15:19:33 20 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__observations .. [OK in 21.33s]
15:19:33 21 of 85 START sql table model dbt_synthea_dev.stg_synthea__organizations ...... [RUN]
15:19:35 21 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__organizations . [OK in 1.71s]
15:19:35 22 of 85 START sql table model dbt_synthea_dev.stg_synthea__patients ........... [RUN]
15:19:38 22 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__patients ...... [OK in 2.91s]
15:19:38 23 of 85 START sql table model dbt_synthea_dev.stg_synthea__payer_transitions .. [RUN]
15:19:54 23 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__payer_transitions [OK in 15.85s]
15:19:54 24 of 85 START sql table model dbt_synthea_dev.stg_synthea__payers ............. [RUN]
15:19:56 24 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__payers ........ [OK in 1.95s]
15:19:56 25 of 85 START sql table model dbt_synthea_dev.stg_synthea__procedures ......... [RUN]
15:20:00 25 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__procedures .... [OK in 4.16s]
15:20:00 26 of 85 START sql table model dbt_synthea_dev.stg_synthea__providers .......... [RUN]
15:20:02 26 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__providers ..... [OK in 1.95s]
15:20:02 27 of 85 START sql table model dbt_synthea_dev.stg_synthea__supplies ........... [RUN]
15:20:04 27 of 85 OK created sql table model dbt_synthea_dev.stg_synthea__supplies ...... [OK in 2.19s]
15:20:04 28 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept ......... [RUN]
15:20:08 28 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept .... [OK in 3.58s]
15:20:08 29 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor [RUN]
15:20:11 29 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor [OK in 3.40s]
15:20:11 30 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_class ... [RUN]
15:20:13 30 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_class [OK in 1.85s]
15:20:13 31 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_relationship [RUN]
15:20:18 31 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_relationship [OK in 4.74s]
15:20:18 32 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__concept_synonym . [RUN]
15:20:20 32 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__concept_synonym [OK in 2.36s]
15:20:20 33 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__domain .......... [RUN]
15:20:22 33 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__domain ..... [OK in 1.77s]
15:20:22 34 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__drug_strength ... [RUN]
15:20:24 34 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__drug_strength [OK in 2.13s]
15:20:24 35 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__relationship .... [RUN]
15:20:26 35 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__relationship [OK in 1.80s]
15:20:26 36 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__source_to_concept_map [RUN]
15:20:26 36 of 85 ERROR creating sql table model dbt_synthea_dev.stg_vocabulary__source_to_concept_map [ERROR in 0.06s]
15:20:26 37 of 85 START sql table model dbt_synthea_dev.stg_vocabulary__vocabulary ...... [RUN]
15:20:28 37 of 85 OK created sql table model dbt_synthea_dev.stg_vocabulary__vocabulary . [OK in 1.75s]
15:20:28 38 of 85 START sql table model dbt_synthea_dev.int__er_visits .................. [RUN]
15:20:31 38 of 85 OK created sql table model dbt_synthea_dev.int__er_visits ............. [OK in 3.11s]
15:20:31 39 of 85 START sql table model dbt_synthea_dev.int__ip_visits .................. [RUN]
15:20:35 39 of 85 OK created sql table model dbt_synthea_dev.int__ip_visits ............. [OK in 4.15s]
15:20:35 40 of 85 START sql table model dbt_synthea_dev.int__op_visits .................. [RUN]
15:20:57 40 of 85 OK created sql table model dbt_synthea_dev.int__op_visits ............. [OK in 22.01s]
15:20:57 41 of 85 START sql table model dbt_synthea_dev.care_site ....................... [RUN]
15:20:59 41 of 85 OK created sql table model dbt_synthea_dev.care_site .................. [OK in 1.77s]
15:20:59 42 of 85 START sql table model dbt_synthea_dev.int__person ..................... [RUN]
15:21:01 42 of 85 OK created sql table model dbt_synthea_dev.int__person ................ [OK in 2.49s]
15:21:01 43 of 85 START sql table model dbt_synthea_dev.location ........................ [RUN]
15:21:04 43 of 85 OK created sql table model dbt_synthea_dev.location ................... [OK in 2.64s]
15:21:04 44 of 85 START sql table model dbt_synthea_dev.provider ........................ [RUN]
15:21:06 44 of 85 OK created sql table model dbt_synthea_dev.provider ................... [OK in 1.96s]
15:21:06 45 of 85 START sql table model dbt_synthea_dev.concept ......................... [RUN]
15:21:11 45 of 85 OK created sql table model dbt_synthea_dev.concept .................... [OK in 4.84s]
15:21:11 46 of 85 START sql table model dbt_synthea_dev.int__source_to_source_vocab_map . [RUN]
15:21:16 46 of 85 OK created sql table model dbt_synthea_dev.int__source_to_source_vocab_map [OK in 5.19s]
15:21:16 47 of 85 START sql table model dbt_synthea_dev.concept_ancestor ................ [RUN]
15:21:21 47 of 85 OK created sql table model dbt_synthea_dev.concept_ancestor ........... [OK in 4.67s]
15:21:21 48 of 85 START sql table model dbt_synthea_dev.concept_class ................... [RUN]
15:21:23 48 of 85 OK created sql table model dbt_synthea_dev.concept_class .............. [OK in 1.81s]
15:21:23 49 of 85 START sql table model dbt_synthea_dev.concept_relationship ............ [RUN]
15:21:28 49 of 85 OK created sql table model dbt_synthea_dev.concept_relationship ....... [OK in 5.81s]
15:21:28 50 of 85 START sql table model dbt_synthea_dev.int__source_to_standard_vocab_map [RUN]
15:21:34 50 of 85 OK created sql table model dbt_synthea_dev.int__source_to_standard_vocab_map [OK in 5.33s]
15:21:34 51 of 85 START sql table model dbt_synthea_dev.concept_synonym ................. [RUN]
15:21:37 51 of 85 OK created sql table model dbt_synthea_dev.concept_synonym ............ [OK in 3.53s]
15:21:37 52 of 85 START sql table model dbt_synthea_dev.domain .......................... [RUN]
15:21:39 52 of 85 OK created sql table model dbt_synthea_dev.domain ..................... [OK in 1.79s]
15:21:39 53 of 85 START sql table model dbt_synthea_dev.drug_strength ................... [RUN]
15:21:43 53 of 85 OK created sql table model dbt_synthea_dev.drug_strength .............. [OK in 3.32s]
15:21:43 54 of 85 START sql table model dbt_synthea_dev.relationship .................... [RUN]
15:21:45 54 of 85 OK created sql table model dbt_synthea_dev.relationship ............... [OK in 1.92s]
15:21:45 55 of 85 SKIP relation dbt_synthea_dev.source_to_concept_map ................... [SKIP]
15:21:45 56 of 85 START sql table model dbt_synthea_dev.cdm_source ...................... [RUN]
15:21:46 56 of 85 OK created sql table model dbt_synthea_dev.cdm_source ................. [OK in 1.84s]
15:21:46 57 of 85 START sql table model dbt_synthea_dev.vocabulary ...................... [RUN]
15:21:48 57 of 85 OK created sql table model dbt_synthea_dev.vocabulary ................. [OK in 1.79s]
15:21:48 58 of 85 START sql table model dbt_synthea_dev.int__all_visits ................. [RUN]
15:22:16 58 of 85 OK created sql table model dbt_synthea_dev.int__all_visits ............ [OK in 28.17s]
15:22:16 59 of 85 START sql table model dbt_synthea_dev.person .......................... [RUN]
15:22:19 59 of 85 OK created sql table model dbt_synthea_dev.person ..................... [OK in 2.42s]
15:22:19 60 of 85 START sql table model dbt_synthea_dev.int__encounter_provider ......... [RUN]
15:22:26 60 of 85 OK created sql table model dbt_synthea_dev.int__encounter_provider .... [OK in 6.78s]
15:22:26 61 of 85 START sql table model dbt_synthea_dev.int__drug_immunisations ......... [RUN]
15:22:27 61 of 85 OK created sql table model dbt_synthea_dev.int__drug_immunisations .... [OK in 1.86s]
15:22:27 62 of 85 START sql table model dbt_synthea_dev.int__drug_medications ........... [RUN]
15:22:43 62 of 85 OK created sql table model dbt_synthea_dev.int__drug_medications ...... [OK in 15.90s]
15:22:43 63 of 85 START sql table model dbt_synthea_dev.int__observation_allergies ...... [RUN]
15:22:46 63 of 85 OK created sql table model dbt_synthea_dev.int__observation_allergies . [OK in 2.41s]
15:22:46 64 of 85 START sql table model dbt_synthea_dev.int__observation_conditions ..... [RUN]
15:22:51 64 of 85 OK created sql table model dbt_synthea_dev.int__observation_conditions [OK in 5.41s]
15:22:51 65 of 85 START sql table model dbt_synthea_dev.int__observation_observations ... [RUN]
15:22:57 65 of 85 OK created sql table model dbt_synthea_dev.int__observation_observations [OK in 6.05s]
15:22:57 66 of 85 START sql table model dbt_synthea_dev.int__assign_all_visit_ids ....... [RUN]
15:23:14 66 of 85 OK created sql table model dbt_synthea_dev.int__assign_all_visit_ids .. [OK in 17.05s]
15:23:14 67 of 85 START sql table model dbt_synthea_dev.death ........................... [RUN]
15:23:17 67 of 85 OK created sql table model dbt_synthea_dev.death ...................... [OK in 2.77s]
15:23:17 68 of 85 START sql table model dbt_synthea_dev.observation_period .............. [RUN]
15:23:20 68 of 85 OK created sql table model dbt_synthea_dev.observation_period ......... [OK in 2.63s]
15:23:20 69 of 85 START sql table model dbt_synthea_dev.payer_plan_period ............... [RUN]
15:23:52 69 of 85 OK created sql table model dbt_synthea_dev.payer_plan_period .......... [OK in 32.26s]
15:23:52 70 of 85 START sql table model dbt_synthea_dev.int__final_visit_ids ............ [RUN]
15:24:03 70 of 85 OK created sql table model dbt_synthea_dev.int__final_visit_ids ....... [OK in 10.63s]
15:24:03 71 of 85 START sql table model dbt_synthea_dev.condition_occurrence ............ [RUN]
15:24:15 71 of 85 OK created sql table model dbt_synthea_dev.condition_occurrence ....... [OK in 12.92s]
15:24:15 72 of 85 START sql table model dbt_synthea_dev.device_exposure ................. [RUN]
15:24:21 72 of 85 OK created sql table model dbt_synthea_dev.device_exposure ............ [OK in 5.36s]
15:24:21 73 of 85 START sql table model dbt_synthea_dev.drug_exposure ................... [RUN]
15:25:27 73 of 85 OK created sql table model dbt_synthea_dev.drug_exposure .............. [OK in 66.43s]
15:25:27 74 of 85 START sql table model dbt_synthea_dev.measurement ..................... [RUN]
15:26:11 74 of 85 OK created sql table model dbt_synthea_dev.measurement ................ [OK in 43.41s]
15:26:11 75 of 85 START sql table model dbt_synthea_dev.observation ..................... [RUN]
15:26:38 75 of 85 OK created sql table model dbt_synthea_dev.observation ................ [OK in 27.04s]
15:26:38 76 of 85 START sql table model dbt_synthea_dev.procedure_occurrence ............ [RUN]
15:26:47 76 of 85 OK created sql table model dbt_synthea_dev.procedure_occurrence ....... [OK in 8.86s]
15:26:47 77 of 85 START sql table model dbt_synthea_dev.visit_detail .................... [RUN]
15:27:10 77 of 85 OK created sql table model dbt_synthea_dev.visit_detail ............... [OK in 23.57s]
15:27:10 78 of 85 START sql table model dbt_synthea_dev.visit_occurrence ................ [RUN]
15:27:33 78 of 85 OK created sql table model dbt_synthea_dev.visit_occurrence ........... [OK in 23.03s]
15:27:33 79 of 85 START sql table model dbt_synthea_dev.condition_era ................... [RUN]
15:27:40 79 of 85 OK created sql table model dbt_synthea_dev.condition_era .............. [OK in 7.18s]
15:27:40 80 of 85 START sql table model dbt_synthea_dev.drug_era ........................ [RUN]
15:28:43 80 of 85 OK created sql table model dbt_synthea_dev.drug_era ................... [OK in 62.78s]
15:28:43 81 of 85 SKIP relation dbt_synthea_dev.int__cost_condition ..................... [SKIP]
15:28:43 82 of 85 START sql table model dbt_synthea_dev.int__cost_drug_exposure_1 ....... [RUN]
15:28:48 82 of 85 OK created sql table model dbt_synthea_dev.int__cost_drug_exposure_1 .. [OK in 4.84s]
15:28:48 83 of 85 START sql table model dbt_synthea_dev.int__cost_drug_exposure_2 ....... [RUN]
15:29:21 83 of 85 OK created sql table model dbt_synthea_dev.int__cost_drug_exposure_2 .. [OK in 32.66s]
15:29:21 84 of 85 START sql table model dbt_synthea_dev.int__cost_procedure ............. [RUN]
15:29:29 84 of 85 OK created sql table model dbt_synthea_dev.int__cost_procedure ........ [OK in 7.95s]
15:29:29 85 of 85 SKIP relation dbt_synthea_dev.cost .................................... [SKIP]
15:29:29
15:29:29 Finished running 85 table models in 0 hours 15 minutes and 31.08 seconds (931.08s).
15:29:29
15:29:29 Completed with 2 errors and 0 warnings:
15:29:29
15:29:29 Runtime Error in model stg_synthea__claims_transactions (models/staging/synthea/stg_synthea__claims_transactions.sql)
Out of Memory Error: failed to offload data block of size 256.0 KiB (16383.9 PiB/1.2 TiB used).
This limit was set by the 'max_temp_directory_size' setting.
By default, this setting utilizes the available disk space on the drive where the 'temp_directory' is located.
You can adjust this setting, by using (for example) PRAGMA max_temp_directory_size='10GiB'
15:29:29
15:29:29 Runtime Error in model stg_vocabulary__source_to_concept_map (models/staging/vocabulary/stg_vocabulary__source_to_concept_map.sql)
Parser Error: SELECT clause without selection list
15:29:29
15:29:29 Done. PASS=80 WARN=0 ERROR=2 SKIP=3 TOTAL=85
For reference, here are the size of the files I used:
146M allergies.csv
354M careplans.csv
28G claims.csv
136G claims_transactions.csv
3.4G conditions.csv
19M devices.csv
9.8G encounters.csv
11G imaging_studies.csv
225M immunizations.csv
11G medications.csv
11G observations.csv
1.3M organizations.csv
317M patients.csv
2.5K payers.csv
8.3G payer_transitions.csv
1.7G procedures.csv
11M providers.csv
183M supplies.csv
---
total 218G
@lawrenceadams
So there are two approaches:
- we could either modify the loader script to glob files , or...
- more simply - given duckdb doesn't need files loaded before it can use them (@katy-sadowski might correct me - but my understanding as to why we have a step where we actively load files into the pipeline is to keep the
source.yml
files identical between database systems e.g. Postgres/Duckdb/etc can use the same code), you could modify asources.yml
file to incorporate something like the below:[...]
Does that make sense? It blew my mind the first time I realised I could do this! Makes working with partitioned CSV/Parquet files so much easier π
It does because, looking at my above comment, I now realize the problem is not so much the loading of data but in the ETL step -- I was unclear where the error was coming from but I have narrowed it above where you can see ETL'ing fails.
So, what I would like is to load multiple patient files chunked however I want (by like 100K patients per data generation) and then ETL those patients per data generation into a final duckdb file that continues to grow as I continue adding in new patients. As it stands right now, my understanding is that it is only one shot. I cannot add new patients to an old duckdb database made with dbt-synthea
after I run dbt run
initially.
Is that helping to frame more my problem?
Yeah this makes sense - I think a problem is that in this case the duckdb query planner fails and cannot figure out what is going on anymore:
Out of Memory Error: failed to offload data block of size 256.0 KiB (16383.9 PiB/1.2 TiB used).
Looks like the query planner has gone berserk there - unsure what is going on... Could you share what platform you're on and how much RAM you have?
A fun problem π
I'm going to try tomorrow - let me know if you have any breakthroughs!
Could you share what platform you're on and how much RAM you have?
Fedora 39 (Desktop) RAM: 64GB Available
Looks like the query planner has gone berserk there - unsure what is going on...
Out of curiosity, is my intuition here that chunking up how many patients I am ETL'ing each time a good thought? Or is this problem unrelated to what solution I wanted sketched out above?
I'd have thought that would be enough! Interesting... π€
Yes that's definitely a way of doing it! Two ways of doing this spring to my mind:
Not sure if either is super ideal but it is possible!
One thing I hadn't appreciated is that every model gets materialised as a table - in reality - especially when using huge source sizes like we are now - it might be best to set the staging models (which do nothing but project ) to materialize as views:
Otherwise we will have 4 copies of the data (loaded / staged / intermediate / omop)!
@TheCedarPrince I got a bit further and then realised I had forgot to seed in the vocabulary π
It did manage to get stg_synthea__claims_transactions
to work though - unsure how best to fix!
dbt run --fail-fast 17:53:10 Running with dbt=1.8.7 17:53:11 Registered adapter: duckdb=1.8.0 17:53:11 Found 86 models, 29 seeds, 425 data tests, 29 sources, 538 macros 17:53:11 17:53:11 Concurrency: 1 threads (target='dev') 17:53:11 17:53:11 1 of 86 START sql table model dbt_synthea_dev.dose_era ......................... [RUN] 17:53:13 1 of 86 OK created sql table model dbt_synthea_dev.dose_era .................... [OK in 1.73s] 17:53:13 2 of 86 START sql table model dbt_synthea_dev.episode .......................... [RUN] 17:53:15 2 of 86 OK created sql table model dbt_synthea_dev.episode ..................... [OK in 1.80s] 17:53:15 3 of 86 START sql table model dbt_synthea_dev.episode_event .................... [RUN] 17:53:16 3 of 86 OK created sql table model dbt_synthea_dev.episode_event ............... [OK in 1.63s] 17:53:16 4 of 86 START sql table model dbt_synthea_dev.fact_relationship ................ [RUN] 17:53:18 4 of 86 OK created sql table model dbt_synthea_dev.fact_relationship ........... [OK in 1.61s] 17:53:18 5 of 86 START sql table model dbt_synthea_dev.metadata ......................... [RUN] 17:53:20 5 of 86 OK created sql table model dbt_synthea_dev.metadata .................... [OK in 1.81s] 17:53:20 6 of 86 START sql table model dbt_synthea_dev.note ............................. [RUN] 17:53:21 6 of 86 OK created sql table model dbt_synthea_dev.note ........................ [OK in 1.63s] 17:53:21 7 of 86 START sql table model dbt_synthea_dev.note_nlp ......................... [RUN] 17:53:23 7 of 86 OK created sql table model dbt_synthea_dev.note_nlp .................... [OK in 1.64s] 17:53:23 8 of 86 START sql table model dbt_synthea_dev.specimen ......................... [RUN] 17:53:25 8 of 86 OK created sql table model dbt_synthea_dev.specimen .................... [OK in 1.78s] 17:53:25 9 of 86 START sql table model dbt_synthea_dev.stg_map__states .................. [RUN] 17:53:26 9 of 86 OK created sql table model dbt_synthea_dev.stg_map__states ............. [OK in 1.64s] 17:53:26 10 of 86 START sql table model dbt_synthea_dev.stg_synthea__allergies .......... [RUN] 17:53:30 10 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__allergies ..... [OK in 3.08s] 17:53:30 11 of 86 START sql table model dbt_synthea_dev.stg_synthea__careplans .......... [RUN] 17:53:33 11 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__careplans ..... [OK in 3.35s] 17:53:33 12 of 86 START sql table model dbt_synthea_dev.stg_synthea__claims ............. [RUN] 17:54:25 12 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__claims ........ [OK in 52.08s] 17:54:25 13 of 86 START sql table model dbt_synthea_dev.stg_synthea__claims_transactions [RUN] 17:58:35 13 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__claims_transactions [OK in 249.68s] 17:58:35 14 of 86 START sql table model dbt_synthea_dev.stg_synthea__conditions ......... [RUN] 17:58:40 14 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__conditions .... [OK in 5.35s] 17:58:40 15 of 86 START sql table model dbt_synthea_dev.stg_synthea__devices ............ [RUN] 17:58:43 15 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__devices ....... [OK in 2.92s] 17:58:43 16 of 86 START sql table model dbt_synthea_dev.stg_synthea__encounters ......... [RUN] 17:58:52 16 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__encounters .... [OK in 8.92s] 17:58:52 17 of 86 START sql table model dbt_synthea_dev.stg_synthea__imaging_studies .... [RUN] 17:59:12 17 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__imaging_studies [OK in 19.78s] 17:59:12 18 of 86 START sql table model dbt_synthea_dev.stg_synthea__immunizations ...... [RUN] 17:59:15 18 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__immunizations . [OK in 3.68s] 17:59:15 19 of 86 START sql table model dbt_synthea_dev.stg_synthea__medications ........ [RUN] 17:59:23 19 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__medications ... [OK in 7.71s] 17:59:23 20 of 86 START sql table model dbt_synthea_dev.stg_synthea__observations ....... [RUN] 18:00:08 20 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__observations .. [OK in 44.97s] 18:00:08 21 of 86 START sql table model dbt_synthea_dev.stg_synthea__organizations ...... [RUN] 18:00:11 21 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__organizations . [OK in 3.00s] 18:00:11 22 of 86 START sql table model dbt_synthea_dev.stg_synthea__patients ........... [RUN] 18:00:15 22 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__patients ...... [OK in 3.84s] 18:00:15 23 of 86 START sql table model dbt_synthea_dev.stg_synthea__payer_transitions .. [RUN] 18:00:23 23 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__payer_transitions [OK in 7.10s] 18:00:23 24 of 86 START sql table model dbt_synthea_dev.stg_synthea__payers ............. [RUN] 18:00:26 24 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__payers ........ [OK in 2.98s] 18:00:26 25 of 86 START sql table model dbt_synthea_dev.stg_synthea__procedures ......... [RUN] 18:00:34 25 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__procedures .... [OK in 8.31s] 18:00:34 26 of 86 START sql table model dbt_synthea_dev.stg_synthea__providers .......... [RUN] 18:00:37 26 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__providers ..... [OK in 3.17s] 18:00:37 27 of 86 START sql table model dbt_synthea_dev.stg_synthea__supplies ........... [RUN] 18:00:40 27 of 86 OK created sql table model dbt_synthea_dev.stg_synthea__supplies ...... [OK in 3.35s] 18:00:40 28 of 86 START sql table model dbt_synthea_dev.stg_vocabulary__concept ......... [RUN] 18:00:40 28 of 86 ERROR creating sql table model dbt_synthea_dev.stg_vocabulary__concept [ERROR in 0.06s] 18:00:40 29 of 86 START sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor [RUN] 18:00:41 CANCEL query model.synthea_omop_etl.stg_vocabulary__concept_ancestor ........... [CANCEL] 18:00:41 29 of 86 ERROR creating sql table model dbt_synthea_dev.stg_vocabulary__concept_ancestor [ERROR in 0.04s] 18:00:41 18:00:41 Runtime Error in model stg_vocabulary__concept (models/staging/vocabulary/stg_vocabulary__concept.sql) Parser Error: SELECT clause without selection list 18:00:41 18:00:41 Finished running 86 table models in 0 hours 7 minutes and 29.65 seconds (449.65s). 18:00:41 18:00:41 Completed with 59 errors and 0 warnings: 18:00:41 18:00:41 Runtime Error in model stg_vocabulary__concept (models/staging/vocabulary/stg_vocabulary__concept.sql) Parser Error: SELECT clause without selection list 18:00:41
- Export each OMOP (final models) at the end of each run to file e.g. parquet; and then stitch them together at the end / leave as partitioned files per se. [This is very janky - but I suppose one appraoch]
This actually is really quite reasonable for a naive and quick approach. I'll try that at some point probably tomorrow. As you know, it would have drawbacks of duplicating lots of information -- I immediately think of the vocab tables.
The more logical approach would be to use incremental models and dbt run 10 times with the sources being different each time
Hunh! Not sure what is involved in that approach but seems interesting!
Otherwise we will have 4 copies of the data (loaded / staged / intermediate / omop)!
AH! That's why my persistent DuckDB database file keeps getting so large! Makes sense.
18:00:41 CANCEL query model.synthea_omop_etl.stg_vocabularyconcept_ancestor ........... [CANCEL] 18:00:41 29 of 86 ERROR creating sql table model dbt_synthea_dev.stg_vocabularyconcept_ancestor [ERROR in 0.04s] 18:00:41
18:00:41 Runtime Error in model stg_vocabularyconcept (models/staging/vocabulary/stg_vocabulary__concept.sql) Parser Error: SELECT clause without selection list 18:00:41
18:00:41 Finished running 86 table models in 0 hours 7 minutes and 29.65 seconds (449.65s). 18:00:41
18:00:41 Completed with 59 errors and 0 warnings: 18:00:41
18:00:41 Runtime Error in model stg_vocabularyconcept (models/staging/vocabulary/stg_vocabulary__concept.sql) Parser Error: SELECT clause without selection list 18:00:41
Yea, I am unsure what this means as well...
Runtime Error in model stg_vocabularysource_to_concept_map (models/staging/vocabulary/stg_vocabularysource_to_concept_map.sql) Parser Error: SELECT clause without selection list
@TheCedarPrince can you check the compiled SQL for this?
Regarding running dbt in batches, we need to be careful because certain tables are "shared" across patients like care_site, location, and provider. I'm guessing dbt's incremental mode would handle this sort of thing? But I haven't looked into it much. My overall take is I'd prefer to avoid these sorts of approaches if at all possible and rather understand:
As I understand it, incremental mode was designed for huge datasets with the need for very frequent updates, which is not the typical use case for OMOP CDMs.
Finally, regarding the duplication of data across stage/int/marts - that's what I hope to address in #38 . Keeping all those tables around is only useful for debugging purposes while developing the ETL.
Re: incremental, you're right: it's really made for updates every n hours/days etc.
If duckdb is unable to cope then we'd probably need a different database engine in this insistence, but this does seem like an extreme stress test/case.
I'd test on Postgres but it'd almost definitely run out of space
I have (with some tuning) eventually managed to get every table build... except observations which runs out of memory sadly for me.
observations.sql
has a lot of complex joins using big tables and this is what can upset duckdb looking through the docs.
Here is the query plan for it for reference:
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ ββ Physical Plan ββ βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββ β PROJECTION β β ββββββββββββββββββββ β β observation_id β β person_id β β observation_concept_id β β observation_date β β observation_datetime β βobservation_type_concept_idβ β NULL β β NULL β β value_as_concept_id β β qualifier_concept_id β β unit_concept_id β β provider_id β β visit_occurrence_id β β visit_detail_id β β observation_source_value β βobservation_source_concept_β β id β β NULL β β NULL β β NULL β β NULL β β NULL β β β β ~319014617 Rows β βββββββββββββββ¬ββββββββββββββ βββββββββββββββ΄ββββββββββββββ β PROJECTION β β ββββββββββββββββββββ β β #0 β β #1 β β #2 β β #3 β β #4 β β #5 β β #6 β β #7 β β #8 β β #9 β β #10 β β #11 β β #12 β β #13 β β #14 β β #15 β β β β ~319014617 Rows β βββββββββββββββ¬ββββββββββββββ βββββββββββββββ΄ββββββββββββββ β WINDOW β β ββββββββββββββββββββ β β Projections: β β ROW_NUMBER() OVER (ORDER β β BY person_id ASC NULLS β β LAST) β βββββββββββββββ¬ββββββββββββββ βββββββββββββββ΄ββββββββββββββ β HASH_JOIN β β ββββββββββββββββββββ β β Join Type: INNER β β β β Conditions: ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β patient_id = β β β person_source_value β β β β β β ~319014617 Rows β β βββββββββββββββ¬ββββββββββββββ β βββββββββββββββ΄ββββββββββββββ βββββββββββββββ΄ββββββββββββββ β PROJECTION β β SEQ_SCAN β β ββββββββββββββββββββ β β ββββββββββββββββββββ β β #0 β β person β β #1 β β β β #2 β β Projections: β β #3 β β person_source_value β β #4 β β person_id β β #5 β β β β #6 β β β β #7 β β β β #8 β β β β #9 β β β β #10 β β β β #11 β β β β__internal_decompress_integβ β β β ral_bigint(#12, 1) β β β β β β β β ~253445155 Rows β β ~1144346 Rows β βββββββββββββββ¬ββββββββββββββ βββββββββββββββββββββββββββββ βββββββββββββββ΄ββββββββββββββ β HASH_JOIN β β ββββββββββββββββββββ β β Join Type: LEFT β β β β Conditions: βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ βencounter_id = encounter_idβ β β patient_id = patient_id β β β β β β ~253445155 Rows β β βββββββββββββββ¬ββββββββββββββ β βββββββββββββββ΄ββββββββββββββ βββββββββββββββ΄ββββββββββββββ β PROJECTION β β PROJECTION β β ββββββββββββββββββββ β β ββββββββββββββββββββ β β #0 β β #0 β β #1 β β #1 β β #2 β β__internal_compress_integraβ β #3 β β l_usmallint(#2, 1) β β #4 β β β β #5 β β β β #6 β β β β #7 β β β β #8 β β β β__internal_decompress_integβ β β β ral_bigint(#9, 1) β β β β β β β β ~253445155 Rows β β ~58246069 Rows β βββββββββββββββ¬ββββββββββββββ βββββββββββββββ¬ββββββββββββββ βββββββββββββββ΄ββββββββββββββ βββββββββββββββ΄ββββββββββββββ β HASH_JOIN β β SEQ_SCAN β β ββββββββββββββββββββ β β ββββββββββββββββββββ β β Join Type: LEFT β β int__encounter_provider β β β β β β Conditions: β β Projections: β βencounter_id = encounter_idββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β encounter_id β β β β β patient_id β β β β β provider_id β β β β β β β ~253445155 Rows β β β ~58246069 Rows β βββββββββββββββ¬ββββββββββββββ β βββββββββββββββββββββββββββββ βββββββββββββββ΄ββββββββββββββ βββββββββββββββ΄ββββββββββββββ β UNION β β PROJECTION β β β β ββββββββββββββββββββ β β β β #0 β β βββββββββββββββββββββββββββββββββββββββββββββ β__internal_compress_integraβ β β β β l_uinteger(#1, 1) β β β β β β β β β β ~56156413 Rows β βββββββββββββββ¬ββββββββββββββ β βββββββββββββββ¬ββββββββββββββ βββββββββββββββ΄ββββββββββββββ βββββββββββββββ΄βββββββββββββββββββββββββββββ΄ββββββββββββββ β PROJECTION β β SEQ_SCAN ββ SEQ_SCAN β β ββββββββββββββββββββ β β ββββββββββββββββββββ ββ ββββββββββββββββββββ β β #0 β βint__observation_observatioββ int__final_visit_ids β β #1 β β ns ββ β β #2 β β ββ Projections: β β #3 β β Projections: ββ encounter_id β β CAST(#4 AS TIMESTAMP) β β patient_id ββ visit_occurrence_id_new β β #5 β β encounter_id ββ β β #6 β β observation_concept_id ββ β β #7 β β observation_date ββ β β β β observation_datetime ββ β β β βobservation_type_concept_idββ β β β β observation_source_value ββ β β β βobservation_source_concept_ββ β β β β id ββ β β β β ββ β β ~29567613 Rows β β ~223877542 Rows ββ ~56156413 Rows β βββββββββββββββ¬ββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββ΄ββββββββββββββ β UNION ββββββββββββββββ βββββββββββββββ¬ββββββββββββββ β βββββββββββββββ΄βββββββββββββββββββββββββββββ΄ββββββββββββββ β SEQ_SCAN ββ SEQ_SCAN β β ββββββββββββββββββββ ββ ββββββββββββββββββββ β β int__observation_allergiesββint__observation_conditionsβ β ββ β β Projections: ββ Projections: β β patient_id ββ patient_id β β encounter_id ββ encounter_id β β observation_concept_id ββ observation_concept_id β β observation_date ββ observation_date β β observation_datetime ββ observation_datetime β βobservation_type_concept_idββobservation_type_concept_idβ β observation_source_value ββ observation_source_value β βobservation_source_concept_ββobservation_source_concept_β β id ββ id β β ββ β β ~719717 Rows ββ ~28847896 Rows β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
It seems like DuckDB will try to spill to disk where it can, including for operations such as joins/orders/windowing it is unable to - but the docs note that if you many of these it may fail to use the disk - which is what I imagine is happening here!
I had some success running in Snowflake (XS Warehouse) - the same queries causing it to fail locally are causing large spill to local warehouse storage. In total took about 20 minutes to run the whole project (without running tests) (~0.3 credits ~ $0.80 pre-tax πΈ )
These are the queries with their metrics if of interest! Metrics (CSV)
I will try to dig in and see what the worst offenders are, but it is measurements/observations at a glance; both of which have large joins with few predicates as far as I can see!
Runtime Error in model stg_vocabularysource_to_concept_map (models/staging/vocabulary/stg_vocabularysource_to_concept_map.sql) Parser Error: SELECT clause without selection list
I realized what this is. The vocab download doesn't come with a source_to_concept_map (duh). I'll file a ticket to handle this table in BYO vocab mode.
Also, I'm looking at the observation SQL. I'm surprised to see there's nothing crazy going on in there. This AND
might not be needed, and we could potentially fold that encounters/providers mapper into the visit int table to skip a join.
@lawrenceadams , what were the runtimes for the other models that succeeded / which were the slowest?
Apologies @katy-sadowski I ran out of time to dig in to the results over the weekend!
These are the 5 slowest models, and generally the ones that needed the most out of memory processing (arguably a moot point as we could reduce the number of threads [threads=8] / use a bigger warehouse - but still interesting).
Query ID | Model | Total Elapsed Time (s) | Bytes Scanned (GB) | % Scanned from Cache | Bytes Written to Result (GB) | Rows Produced | Rows Written to Result | Bytes (GB) spilled to Local Storage | Execution Time (ms) |
---|---|---|---|---|---|---|---|---|---|
01b7f2aa-0000-673b-0000-f5a10005ea16 | observation | 754,447 (~12.5 min) | 4.1 GB | 92.30% | 3.3 GB | 253,447,438 | 1 | 35.693396 | 753,980 ms (~12.5 min) |
01b7f2aa-0000-674d-0000-f5a100063052 | visit_detail | 370,289 (~6.2 min) | 3.4 GB | 82% | 2.0 GB | 54,933,090 | 1 | 16.240853 | 369,871 ms (~6.2 min) |
01b7f2aa-0000-6713-0000-f5a100062362 | visit_occurrence | 368,723 (~6.1 min) | 3.4 GB | 83.60% | 1.9 GB | 54,933,090 | 1 | 15.9126733 | 368,252 ms (~6.1 min) |
01b7f2aa-0000-673b-0000-f5a10005ea1e | procedure_occurrence | 345,095 (~5.8 min) | 3.9 GB | 57.70% | 1.3 GB | 45,071,224 | 1 | 20.5998039 | 344,142 ms (~5.8 min) |
01b7f2b0-0000-6714-0000-f5a1000614a6 | int__cost_drug_exposure_2 | 319,436 (~5.3 min) | 3.8 GB | 79.20% | 0.8 GB | 53,469,529 | 1 | 10.4926577 | 318,674 ms (~5.3 min) |
Looking at the most expensive nodes in observation:
This join is the most expensive, followed by the one you linked to, followed by the window function
I agree @katy-sadowski that second join predicate looks redundant - although interestingly when running that model by itself with and without the join on the patient ID results in 79 fewer rows being returned (253447517 [join on encounter_id] - 253447438 [join on both encounter and patient ID] = 79), which is interesting... I'll need to inspect why!
When running the model by itself they're much faster (Only done on observation but took 2 minutes instead of 12) - at some point I'll run on snowflake with one thread to see what happens; it will be more useful!
Interesting thanks @lawrenceadams !
This join is the most expensive, followed by the one you linked to, followed by the window function
In this case I wonder if it would be faster to move the joins down into the individual intermediate tables.
interestingly when running that model by itself with and without the join on the patient ID results in 79 fewer rows being returned (253447517 [join on encounter_id] - 253447438 [join on both encounter and patient ID] = 79)
weird - agree that should be checked! are there encounters missing a patient ID in your source data? (there aren't any in the seed dataset). if so we probably want to handle that scenario explicitly in the ETL.
When running the model by itself they're much faster (Only done on observation but took 2 minutes instead of 12) - at some point I'll run on snowflake with one thread to see what happens; it will be more useful!
you mean dbt run --model observation
?
In this case I wonder if it would be faster to move the joins down into the individual intermediate tables.
Quite possibly, from memory there is no filtering that happens at any point so we'd have to pay the price at some point. Maybe worth trying!
When I've done this on real EHR data I tend to have these joins consolidated before the end, and use the final models to do things like global transformations (e.g. join vocabulary where needed) - but we only do that once. I wonder how others in this space handle it!
weird - agree that should be checked! are there encounters missing a patient ID in your source data? (there aren't any in the seed dataset). if so we probably want to handle that scenario explicitly in the ETL.
Great shout - I did this and there are 79 cases which violate this assumption! No idea how this happens as I can't imagine it's a feature of Synthea?
It might be worth re-building the dataset and seeing if it happens for others. Can others replicate?
you mean dbt run --model observation?
if effect yep, I re ran it with dbt run --threads 1
just now and it was similar (measurement, observation, int__assign_all_visit_ids, int__cost_drug_exposure_2, int__cost_procedure
were the top 5 offenders!)
When I've done this on real EHR data I tend to have these joins consolidated before the end, and use the final models to do things like global transformations (e.g. join vocabulary where needed) - but we only do that once. I wonder how others in this space handle it!
Maybe a good discussion topic for our next meeting.
It might be worth re-building the dataset and seeing if it happens for others. Can others replicate?
I will try. I'm guessing it's a bug in Synthea (or maybe a weird feature like this one). Once I've got my mega dataset I can start experimenting with some of these optimization ideas too :)
I can try replicating! What do you want me to try doing @lawrenceadams ?
Maybe a good discussion topic for our next meeting.
Great idea!
I can try replicating! What do you want me to try doing @lawrenceadams ?
Amazing!! Did you manage to make a huge synthea dataset? Is it possible to check if encounters have a different patient ID attached to them, as above?
I can try replicating! What do you want me to try doing @lawrenceadams ?
Amazing!! Did you manage to make a huge synthea dataset? Is it possible to check if encounters have a different patient ID attached to them, as above?
So you want me to generate a huge synthea dataset? Would 1 million patients with 3 or 5 year retrospective be enough to be "huge" for you @lawrenceadams?
I can generate that and then check out the different patient ID. :)
Hey @katy-sadowski,
I ended up opening an issue as Discussions is not enabled for this repo.
Wanted to share some notes on my experiences with you on using this solution for 1 million patients each with 5 year retrospective look back:
cffi
being needed -- might be missing from dependencies somehow https://github.com/OHDSI/dbt-synthea/issues/88.dbt
file should go was a little vaguedbt run
fails on large tables in the OMOP Schema -- especially those tables which have a large lineage.I am going to reproduce these errors tomorrow and actually generate 1 million patients from scratch using Synthea and re-run the pipeline to find all the tables that fail to be built.
Hope this helps folks!
~ tcp :deciduous_tree: