OHDSI / dbt-synthea

[Under development] A dbt ETL project to convert a Synthea synthetic data set into the OMOP CDM
https://ohdsi.github.io/dbt-synthea/
Apache License 2.0
16 stars 6 forks source link

Add cross-database macros where needed to ensure SQL queries work on all OHDSI-supported database platforms #39

Open katy-sadowski opened 5 months ago

katy-sadowski commented 5 months ago

Cross-database macros are offered by dbt to compile sql depending on the target DBMS: https://docs.getdbt.com/reference/dbt-jinja-functions/cross-database-macros

The goal of this issue is to assess what SQL statements in the project will not translate across all DBMS, and replace them with a cross-DB macro (if possible).

OHDSI supported platforms: https://ohdsi.github.io/Hades/supportedPlatforms.html

lawrenceadams commented 1 month ago

I've done a bit of playing to discover issues we may run into when not using duckdb/postgres. Here are some of my findings:

System Finding
SQLite Does not support multiple schemas (treats as separate physical databases) Differences in how views are required caused issues
Snowflake Unable to get very far as START is a protected keyword (used in a lot of synthea tables) and so issues thrown when using seed. → will probably need a custom data loader to get around this. More work needed.
SQL Server Found fiddly to get setup locally. Not yet successful - from memory due to keyword utilization but may be wrong.

Probably all easily overcome with a few hours - but not as straightforward as I would have thought! DuckDb is almost too friendly in comparison! 😆

vvcb commented 1 month ago

Getting dbt to work across database engines is not worth the hassle and you will quickly hit a tech debt ceiling.

Cross-database Synthea-OMOP transformation is already mostly solved here - https://github.com/lsc-sde/sqlmesh_synthea

@lawrenceadams , please take a look at https://sqlmesh.com/ and https://sqlglot.com/sqlglot.html

We are using SQLMesh for all new OMOP projects for this reason and many others. We build locally on DuckDb, and deploy to on-prem SQL Server and Azure Databricks because we are 'migrating' and are likely to be migrating for the next decade :-D. So, important for us to be able to manage a single codebase for multiple database engines and for several people to work on the same data warehouse/ lakehouse at the same time.

lawrenceadams commented 1 month ago

Yeah came to the same conclusion! Was an interesting bit of discovery!

Yep well aware of SQLMesh/Glot, is a nice tool!

vvcb commented 1 month ago

And, if we want to continue to use dbt with some cross-platform support, then targeting the engines that are as close to being ANSI-compliant as possible and refactoring any existing SQL to ANSI-SQL will be a good start.

One way to do this would be to take all the compiled code that dbt generates currently and transpiling it to ANSI using SQLGlot and looking at the diffs.

katy-sadowski commented 1 month ago

One way to do this would be to take all the compiled code that dbt generates currently and transpiling it to ANSI using SQLGlot and looking at the diffs.

I like the idea, would be a cool part of this exploration, and a great learning opportunity!

@lawrenceadams regarding Snowflake and the START keyword, would this help: https://docs.getdbt.com/reference/resource-configs/quote_columns

lawrenceadams commented 1 month ago

@lawrenceadams regarding Snowflake and the START keyword, would this help: https://docs.getdbt.com/reference/resource-configs/quote_columns

Excellent shout - this resolved that issue! I'll have more of a play now that's unblocked!