dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.8k stars 1.62k forks source link

[CT-1356] [Feature] incremental models' merge conditional update #6077

Closed Yotamho closed 1 year ago

Yotamho commented 2 years ago

Is this your first time submitting a feature request?

Describe the feature

In some databases (namely snowflake and postgres), it is possible to add a condition to a merge update clause: when matched and <condition> then ... I want to allow to add this kind of condition to incremental models (by making it an incremental model configuration).

Describe alternatives you've considered

A considerable alternative would be to join the records in the incremental run, with their destination records, to check for which rows the condition is not satisfied, and omit these rows. such solution would be less performant and will make the model more complicated to understand.

Who will this benefit?

For example when there is a scenario of out of order data that is arbitrarily inserted into the source table of a model, this feature allow us to omit this data by using a condition like: DBT_INTERNAL_SOURCE.timestamp > DBT_INTERNAL_DEST.timestamp.

Are you interested in contributing this feature?

yes

Anything else?

I have forks with suggested implementation: https://github.com/Yotamho/dbt-core https://github.com/Yotamho/dbt-snowflake

jtcohen6 commented 1 year ago

@Yotamho Thanks for opening!

This ask sounds to me a lot like the ask to pass additional custom "predicates" to the merge performed by dbt incremental models. There's been longstanding discussion and work in progress on that. See: https://github.com/dbt-labs/dbt-core/issues/3293, https://github.com/dbt-labs/dbt-core/issues/5680, https://github.com/dbt-labs/dbt-snowflake/issues/231, ...

There's a subtle distinction between:

merge into ...
    from ...
    on <unique_key_match> and <custom_condition> ...
    when matched then update ...
    when not matched then insert ...

and:

merge into ...
    from ...
    on <unique_key_match> ...
    when matched and <custom_condition> then update ...
    when not matched then insert ...

In the first case, if the unique key matches but the custom condition isn't met, the row will be inserted as a new record instead of updating an existing record. (Could lead to duplicates in the target table.)

In the second case, if the unique key matches, but the custom condition isn't met, that new row goes ... nowhere. Is that preferable behavior? Or risk greater confusion?


It may help to get concrete about specific use cases. Looking at yours:

For example when there is a scenario of out of order data that is arbitrarily inserted into the source table of a model, this feature allow us to omit this data by using a condition like: DBT_INTERNAL_SOURCE.timestamp > DBT_INTERNAL_DEST.timestamp.

Is this different from the common logic that we ask users to stick in incremental models? Yes, this is an implicit join within the incremental model logic instead, but I'd rather have it live in the model logic, where it can be more easily seen / debugged, than tucked away in the materialization logic / merge statement:

select * from {{ ref('upstream_table') }}
{% if is_incremental() %}
where timestamp_col > (select max(timestamp_col) as max_ts from {{ this }})
{% endif %}
Yotamho commented 1 year ago

Hi @jtcohen6 , thanks for responding! I see the benefit and readability of sticking to the incremental model pattern, But in certain use cases I think the discussed enhancement is still needed, I would like to demonstrate - Lets say the model runs twice on 2 bulks of records:

# run 1:
# name, timestamp
john, 01:00
alex, 03:00

# run 2:
# name, timestamp
john, 01:00
alex, 03:00
john, 02:00

If the name is the unique key, and we wish to update records when they arrive with newer timestamp, if we simply base the incremental model on the timestamp we will ignore the updated "john" record. If we add the condition src.timestamp > dst.timestamp to the merge join condition, the name will no longer act as a unique key and and we will have 2 records of "john".

When data might come out of order, I can think of 2 implementations of incremental models (I'm sure there are more approaches):

  1. Using a column that doesn't ensure order, but ensures that you don't miss data (e.g sequential id that is atomically generated upon the first insertion to the database), in addition to this column, using the order column (the timestamp column in the example given above), as a "merge condition" which is thrown away when the record we wish to insert is older.
  2. Adding overlaps between model runs, for example in the given example, making it also run on 1 hour before the "max_ts", and filtering out records using the same "merge_condition" discussed (I think this is more applicable when you know a certain time that out of order records should arrive in).
jtcohen6 commented 1 year ago

(Jumping in quickly because @dbeatty10 has pointed out to me that, where I said "subtle distinction" above, I had initially pasted the exact same code twice. Sorry about that!!)

jtcohen6 commented 1 year ago

(Doug - assigning both of us, just so that one of us writes a quick follow-up to the conversation we had on Monday!)

Yotamho commented 1 year ago

Hey, @jtcohen6 @dbeatty10 , Just commenting for your attention because I feel it was forgotten, Thanks

tphillip33 commented 1 year ago

I am very interested in this.

My use-case is I only want to update when a column "hashdiff" has change: when matched and old.hashdiff<>new.hashdiff then update

Without that ability, it requires a self-join to identify rows which actually changed. This requires processing, when adding this simple compare would fix that issue.

I am considering a custom materialization to work around this issue.

jtcohen6 commented 1 year ago

After giving this another read through, I see significant overlap between this issue, and one I just responded to yesterday: https://github.com/dbt-labs/dbt-core/issues/6415#issuecomment-1348827516

Two points that carry over in particular:

tphillip33 commented 1 year ago

Interesting. I will need to update to 1.3 and see if that will allow me to fix my problem.

tphillip33 commented 1 year ago

In reviewing this implementation. For my purpose, I think I would rather simply override default__get_merge_sql, to accept a new config setting of "merge_update_only_when_columns_different" (or something) and append it to the "when matched {merge_update_only_when_columns_different code here} then update set..."

I will try that and see how that goes.

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.