Add on change to merge incremental strategy

IvanDerdicDer commented 6 months ago

Describe the feature

Currently the merge strategy updates all rows when the key matches. This produces duplicate data in Change Data Feed is the updated row did not have any updates. To fix this when update columns are defined it should also build a condition on those columns that is added to the matched, for example "when matched and (s.count <> d.count or s.max <> d.max then d.count = s.count, d.max = s.max)".

Describe alternatives you've considered

There are two alternatives to this but those solution require an extra join that would not be necessary if this feature is implemented. First option is if dbt run is incremental do a semi join on the dest table and filter out only new or updated rows that way. Second option is to filter out non updated rows after reading the Change Data Feed by semi joining back on the same dataset.

Additional context

Please include any other relevant context here.

Who will this benefit?

This benefits anyone who uses the merge strategy and Change Data Feed together since this way no duplication occurs in the change data feed. Also it benefits anyone using the merge strategy since it should execute faster because it doesn't have to update all rows that match on the key, just the ones that are updated.

Are you interested in contributing this feature?

I would like to implement this feature. I would just need about setting up the development environment and testing.

benc-db commented 6 months ago

Let me know how I can help you get started. I agree that it would be useful to keep the CDC feed clean, but do not have capacity to work on this at this time.

IvanDerdicDer commented 5 months ago

Hi,

Thnak you for your offer to help. I found the contribution guide so i'll start from there if i get stuck i'll get in touch.

mi-volodin commented 2 months ago

Fixed in #739

databricks / dbt-databricks