dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.68k stars 1.61k forks source link

[Feature] Create a flexible delete+insert incremental strategy without relying on primary/unique keys #10655

Open dlord opened 3 weeks ago

dlord commented 3 weeks ago

Is this your first time submitting a feature request?

Describe the feature

The delete+insert incremental strategy creates a temporary table with the new dataset, then it executes a DELETE statement using the temporary table as the filter condition. Below is an example of the generated statement:

delete
from
  "source_db"."dbt"."source_table" using "source_table__dbt_tmp173111887980" 
where
  (
    "source_table__dbt_tmp173111887980".event_time = "source_db"."dbt"."source_table".event_time 
  );

The current approach uses the unique_key specified in the model, and it probably works fine for most relational databases. For data warehouses like Redshift, this approach is very slow due to lack of primary and unique keys. As a result, the delete operation becomes very slow due to sequential scanning.

In order to work around this, I created a custom incremental strategy for a customer using the approach described here. The goal was to be able to provide a flexible way to specify custom predicates that works well with the underlying data and storage. The custom incremental strategy I wrote uses the existing incremental_predicates to build the delete statement. The custom incremental strategy ignores the unique_key to keep the implementation simple, and easy to switch to other strategies.

CC: @goodjira and @schmiman

Describe alternatives you've considered

Currently, if the merge strategy is set to delete+insert but no unique_key was specified, it is as if you used an append strategy. This behavior can be seen in the macro default__get_delete_insert_merge_sql. I am not sure if this was a deliberate decision.

As such, an alternative approach to creating a separate incremental strategy is to modify the existing default__get_delete_insert_merge_sql to allow users to not specify a unique_key, and only rely on the incremental_predicates.

Who will this benefit?

Users who are working with data warehouses like Redshift. Date columns that are used as sort keys can significantly benefit from this.

Here is an example: consider a model in Redshift that was created with the following config:

{{ config(materialized="incremental",
            incremental_strategy='delete_insert_no_unique_key',
            incremental_predicates = [
                "DBT_INTERNAL_DEST.event_time >= (select min(event_time) from DBT_INTERNAL_SOURCE)"
            ],
) }}

The resulting delete SQL will be this:

delete
from
  "source_db"."dbt"."source_table"
where
  (
    "source_db"."dbt"."source_table".event_time >= (select min(event_time) from "source_table__dbt_tmp173111887980")
  );

The performance difference in Redshift compared to a delete join is huge. I had one model where deleting 10M records took 30 minutes using the join approach. With the custom delete+insert strategy, this was reduced to 10 seconds due to the optimizations we can do in incremental_predicates

Are you interested in contributing this feature?

Yes

Anything else?

No response