dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.94k stars 1.63k forks source link

[CT-3211] [Feature] Globally unique temporary tables for incremental models #8837

Open philippefutureboy opened 1 year ago

philippefutureboy commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

Feature

We would like temporary tables used by dbt to merge into incremental models to either:

Context

We have a multitenant architecture where each tenant has its own separate materialization of each model (separated in different datasets in BigQuery). In our latest feature, we want to provide industry wide analytics that compare our different tenants to each other. In order to achieve this, we created an industry-wide incremental model with insert_overwrite strategy, partitioned by a unique integer for each of our tenants. Exact config block:

config(
        materialized='incremental',
        incremental_strategy='insert_overwrite',
        partition_by={
            'field': 'client_int_partition_key',
            'data_type': 'int64',
            'range': {
                'start': 1,
                'end': 4000,
                'interval': 1
            }
        },
    )

Conceptually, this is great as it allows us to have all our tenants write to the same centralized table even if their company's data is located in their own tenant-scoped datasets. However, in practice an issue arise when >=2 tenants attempt to write to the industry-scoped incremental model, as the incremental macro implementation creates a temporary table, enacts the merge to the incremental model, and then deletes the temporary table. The >=2 tenants end up conflicting with each other either by deleting the table before the other's operations are done or by truncating the data of the first tenant with the data of the next one.

The exact error produced by dbt is the following:

INFO - 23:50:46  Database Error in model {model_name} (path/to/model_name.sql)
INFO - 23:50:46    Destination deleted/expired during execution: {project}:{dataset}.{model_name}__dbt_tmp. at [20:3]
INFO - 23:50:46    compiled SQL at {target-path}/run/path/to/{model_name}.sql

Hence the suggested solution is to change the suffix __dbt_tmp to __dbt_tmp_{GUID/user provided suffix}. I think the GUID version would be way simpler to implement, and not break Dependency Inversion Principle (SOLID).

Describe alternatives you've considered

For the time being, we'll be using Airflow's Pools to prevent parallelism in execution of the incremental insertions. That way there won't be any conflict between the various tenants, they will run their incremental insertions sequentially.

EDIT: After implementing the alternative, it is confirmed approach solves the issue.

Who will this benefit?

Anyone who may have multiple actors/processes appending to a model concurrently.

Are you interested in contributing this feature?

Yes, I think that would be doable if we choose on the GUID solution. All adapters must have some form of GUID I can use, so the feature could be simple to implement?

Anything else?

:)

dbeatty10 commented 1 year ago

Thanks for opening this @philippefutureboy and providing such a nice write-up 👍

Preventing parallelism with your preferred strategy (Airflow's Pools, etc) is what we'd recommend in general since dbt is designed for serial execution.

While it wouldn't be a high priority from our end, we'd still be amenable to considering either option you presented. One tricky thing to handle is staying within the maximum identifier length per dbt adapter.

I'm going to mark this as help_wanted.

philippefutureboy commented 1 year ago

Hey @dbeatty10!

Thanks for your answer! I totally understand that dbt is made to be run serially. In my context, bar using the workflow's management semaphore/lock implementation, it's not possible.

If there is sufficient interest on dbt Labs' side for this feature, I could give it a look.

In terms of maximum identifier length, which of the following is true?

  1. The maximum identifier length is exposed by each adapter already;
  2. The maximum identifier length is static for all database engine supported and could be written as a constant for each adapter;
  3. The maximum identifier length can be dynamic in one or more database engine, and needs to be written as a custom macro for each adapter ?

Once that is solved, creating a macro for a suffix is not super complex, especially with jinja2/python available.