Open philippefutureboy opened 1 year ago
Thanks for opening this @philippefutureboy and providing such a nice write-up 👍
Preventing parallelism with your preferred strategy (Airflow's Pools, etc) is what we'd recommend in general since dbt is designed for serial execution.
While it wouldn't be a high priority from our end, we'd still be amenable to considering either option you presented. One tricky thing to handle is staying within the maximum identifier length per dbt adapter.
I'm going to mark this as help_wanted
.
Hey @dbeatty10!
Thanks for your answer! I totally understand that dbt is made to be run serially. In my context, bar using the workflow's management semaphore/lock implementation, it's not possible.
If there is sufficient interest on dbt Labs' side for this feature, I could give it a look.
In terms of maximum identifier length, which of the following is true?
Once that is solved, creating a macro for a suffix is not super complex, especially with jinja2/python available.
Is this your first time submitting a feature request?
Describe the feature
Feature
We would like temporary tables used by dbt to merge into incremental models to either:
Context
We have a multitenant architecture where each tenant has its own separate materialization of each model (separated in different datasets in BigQuery). In our latest feature, we want to provide industry wide analytics that compare our different tenants to each other. In order to achieve this, we created an industry-wide incremental model with
insert_overwrite
strategy, partitioned by a unique integer for each of our tenants. Exact config block:Conceptually, this is great as it allows us to have all our tenants write to the same centralized table even if their company's data is located in their own tenant-scoped datasets. However, in practice an issue arise when >=2 tenants attempt to write to the industry-scoped incremental model, as the incremental macro implementation creates a temporary table, enacts the merge to the incremental model, and then deletes the temporary table. The >=2 tenants end up conflicting with each other either by deleting the table before the other's operations are done or by truncating the data of the first tenant with the data of the next one.
The exact error produced by dbt is the following:
Hence the suggested solution is to change the suffix
__dbt_tmp
to__dbt_tmp_{GUID/user provided suffix}
. I think the GUID version would be way simpler to implement, and not break Dependency Inversion Principle (SOLID).Describe alternatives you've considered
For the time being, we'll be using Airflow's Pools to prevent parallelism in execution of the incremental insertions. That way there won't be any conflict between the various tenants, they will run their incremental insertions sequentially.
EDIT: After implementing the alternative, it is confirmed approach solves the issue.
Who will this benefit?
Anyone who may have multiple actors/processes appending to a model concurrently.
Are you interested in contributing this feature?
Yes, I think that would be doable if we choose on the GUID solution. All adapters must have some form of GUID I can use, so the feature could be simple to implement?
Anything else?
:)