dbt-labs / dbt-adapters

Apache License 2.0
17 stars 20 forks source link

[Bug] Delete+insert incremental strategy doesn't support incremental predicates when a list of unique keys is used #230

Open rlh1994 opened 1 month ago

rlh1994 commented 1 month ago

Is this a new bug?

Current Behavior

This is probably halfway between a bug and a feature, I can't find documented anywhere that this isn't supported, and the docs suggest it should be possible, but it's also not something that I guess has ever worked.

The delete+insert strategy when a list of unique keys is provided uses the delete from X using Y syntax and then attaches the predicates further down in the condition. However, because the predicate may reference a column that exists in both tables (which in most cases it would), this fails due to an ambiguous column reference (at least on redshift+postgres, i haven't yet tested other warehouses).

https://github.com/dbt-labs/dbt-adapters/blob/a2292c8ec76e4c9f03869ad95817a2ad82dfb34b/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L65:L77

The docs suggest that you can use DBT_INTERNAL_DEST and DBT_INTERNAL_SOURCE values for the merge strategy, but there is no equivalent for the delete+insert. This means it is not possible to use incremental predicates with the delete+insert strategy with a list-based unique key.

I do acknowledge that this is largely due to the fact that aliasing tables in a delete statement is in many cases not actually supported, however I believe there are potential solutions to this (none of which are perfect) discussed below.

Expected Behavior

I would expect the delete+insert strategy to support some way to use incremental predicates while using a list-based unique key field.

One way this could be achieved would be to still support the use of DBT_INTERNAL_DEST and DBT_INTERNAL_SOURCE in the incremental predicates, but to replace these with the absolute target and source table names at compile time in the macro before returning the SQL. This isn't ideal as you're actively editing the predicates that were provided, but would ensure a consistent behaviour between strategies.

Steps To Reproduce

  1. New project, add a seed file called my_seed.csv with the following data
    run,id,id2,start_tstamp
    1,1,1,2021-01-01 00:00:00
    1,2,2,2021-03-01 00:00:00
    1,2,2,2021-03-03 00:00:00
    1,3,3,2021-03-03 00:00:00
    1,4,4,2021-03-04 00:00:00
    2,1,1,2021-03-05 00:00:00
    2,2,2,2021-03-02 00:00:00
    2,3,3,2021-03-07 00:00:00
    2,5,5,2021-03-08 00:00:00
  2. Create a model called my_modelsql` with the following content

    {{
      config(
        materialized='incremental',
        incremental_strategy='delete+insert',
        unique_key=['id', 'id2'],
        upsert_date_key='start_tstamp',
        incremental_predicates= ["start_tstamp > date '2021-01-01'"]
      )
    }}
    
    with data as (
      select * from {{ ref('my_seed') }}
    )
    
    {% if is_incremental() %}
    
      select
        id,
        id2,
        start_tstamp
    
      from data
      where run = 2
    
    {% else %}
    
      select
        id,
        id2,
        start_tstamp
    
      from data
      where run = 1
    
    {% endif %}
  3. dbt seed, dbt run, dbt run again. On the second run you should get the following error
    column reference "start_tstamp" is ambiguous
      LINE 17:                         and start_tstamp > date '2021-01-01'
                                           ^
      compiled Code at target/run/dbt_demo/models/my_models.sql

Relevant log output

No response

Environment

- OS: MacOSx
- Python: 3.9.14
- dbt-adapter: postgres 1.7.11 (although this issue likely applied in other warehouses)

Additional Context

No response