databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
226 stars 119 forks source link

Set unique table suffix to allow parallel incremental execution #810

Closed huangxingyi-git closed 1 month ago

huangxingyi-git commented 1 month ago

Describe the feature

For some specific cases (eg. backfill very large amount of data), we need to execute parallel multiple dbt run of specific incremental(replace_where) model in which we pass the date (or country) as var argument. For example, we have a model we run every day using Airflow for which we pass the a date relative to the Airflow scheduler. FYI https://github.com/dbt-labs/dbt-athena/pull/650/files

If we want to process by batch of N days in parallel using Airflow concurrency, we need the tmp table create by each of the dbt run to be unique. Else, you are going to end up with N insert attempting to run with the same __dbt_tmp name, creating conflict and ultimately creating failure.

Who will this benefit?

For those who uses repalce_where as incremental strategy. Example Use Case: Run the same incremental model concurrently with different --vars in order to parallelly insert multiple data partitions

Are you interested in contributing this feature?

I am interested in contributing to this feature if needed.

benc-db commented 1 month ago

This issue is going to solved more comprehensively with https://github.com/dbt-labs/dbt-core/discussions/10672; however, I'll take a look at your PR, and assuming it doesn't hurt any other use case, I'm not against taking this in the short term.