Closed Yotamho closed 1 year ago
@Yotamho Thanks for opening!
This ask sounds to me a lot like the ask to pass additional custom "predicates" to the merge
performed by dbt incremental models. There's been longstanding discussion and work in progress on that. See: https://github.com/dbt-labs/dbt-core/issues/3293, https://github.com/dbt-labs/dbt-core/issues/5680, https://github.com/dbt-labs/dbt-snowflake/issues/231, ...
There's a subtle distinction between:
merge into ...
from ...
on <unique_key_match> and <custom_condition> ...
when matched then update ...
when not matched then insert ...
and:
merge into ...
from ...
on <unique_key_match> ...
when matched and <custom_condition> then update ...
when not matched then insert ...
In the first case, if the unique key matches but the custom condition isn't met, the row will be inserted as a new record instead of updating an existing record. (Could lead to duplicates in the target table.)
In the second case, if the unique key matches, but the custom condition isn't met, that new row goes ... nowhere. Is that preferable behavior? Or risk greater confusion?
It may help to get concrete about specific use cases. Looking at yours:
For example when there is a scenario of out of order data that is arbitrarily inserted into the source table of a model, this feature allow us to omit this data by using a condition like: DBT_INTERNAL_SOURCE.timestamp > DBT_INTERNAL_DEST.timestamp.
Is this different from the common logic that we ask users to stick in incremental models? Yes, this is an implicit join within the incremental model logic instead, but I'd rather have it live in the model logic, where it can be more easily seen / debugged, than tucked away in the materialization logic / merge
statement:
select * from {{ ref('upstream_table') }}
{% if is_incremental() %}
where timestamp_col > (select max(timestamp_col) as max_ts from {{ this }})
{% endif %}
Hi @jtcohen6 , thanks for responding! I see the benefit and readability of sticking to the incremental model pattern, But in certain use cases I think the discussed enhancement is still needed, I would like to demonstrate - Lets say the model runs twice on 2 bulks of records:
# run 1:
# name, timestamp
john, 01:00
alex, 03:00
# run 2:
# name, timestamp
john, 01:00
alex, 03:00
john, 02:00
If the name is the unique key, and we wish to update records when they arrive with newer timestamp, if we simply base the incremental model on the timestamp we will ignore the updated "john" record.
If we add the condition src.timestamp > dst.timestamp
to the merge join condition, the name will no longer act as a unique key and and we will have 2 records of "john".
When data might come out of order, I can think of 2 implementations of incremental models (I'm sure there are more approaches):
max_ts
", and filtering out records using the same "merge_condition" discussed (I think this is more applicable when you know a certain time that out of order records should arrive in).(Jumping in quickly because @dbeatty10 has pointed out to me that, where I said "subtle distinction" above, I had initially pasted the exact same code twice. Sorry about that!!)
(Doug - assigning both of us, just so that one of us writes a quick follow-up to the conversation we had on Monday!)
Hey, @jtcohen6 @dbeatty10 , Just commenting for your attention because I feel it was forgotten, Thanks
I am very interested in this.
My use-case is I only want to update when a column "hashdiff" has change:
when matched and old.hashdiff<>new.hashdiff then update
Without that ability, it requires a self-join to identify rows which actually changed. This requires processing, when adding this simple compare would fix that issue.
I am considering a custom materialization to work around this issue.
After giving this another read through, I see significant overlap between this issue, and one I just responded to yesterday: https://github.com/dbt-labs/dbt-core/issues/6415#issuecomment-1348827516
Two points that carry over in particular:
select
statement.merge
to work differently for certain models. There's a new & better pattern for doing this, starting in v1.3 (but we still need to document it! https://github.com/dbt-labs/docs.getdbt.com/issues/1761). You could define (e.g.) a macro named get_incremental_merge_conditional_update_sql
, and set incremental_strategy: 'merge_conditional_update'
in your model config, along with any other configs that would be relevant to templating your custom merge
statement. All that, without having to copy-paste-edit the incremental
materialization, or defining one of your own.Interesting. I will need to update to 1.3 and see if that will allow me to fix my problem.
In reviewing this implementation. For my purpose, I think I would rather simply override default__get_merge_sql, to accept a new config setting of "merge_update_only_when_columns_different" (or something) and append it to the "when matched {merge_update_only_when_columns_different code here} then update set..."
I will try that and see how that goes.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
Is this your first time submitting a feature request?
Describe the feature
In some databases (namely snowflake and postgres), it is possible to add a condition to a merge update clause:
when matched and <condition> then ...
I want to allow to add this kind of condition to incremental models (by making it an incremental model configuration).Describe alternatives you've considered
A considerable alternative would be to join the records in the incremental run, with their destination records, to check for which rows the condition is not satisfied, and omit these rows. such solution would be less performant and will make the model more complicated to understand.
Who will this benefit?
For example when there is a scenario of out of order data that is arbitrarily inserted into the source table of a model, this feature allow us to omit this data by using a condition like:
DBT_INTERNAL_SOURCE.timestamp > DBT_INTERNAL_DEST.timestamp
.Are you interested in contributing this feature?
yes
Anything else?
I have forks with suggested implementation: https://github.com/Yotamho/dbt-core https://github.com/Yotamho/dbt-snowflake