Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
511 stars 131 forks source link

[BUG] BUG: Duplicates loaded into existing satellite #221

Closed khalidhussein138 closed 8 months ago

khalidhussein138 commented 10 months ago

Describe the bug When loading into a Satellite table where Stage contains two or more duplicate records (the same hash_key, hash_diff and load_timestamp) and the records have not been previously loaded into the satellite, both records are loaded into the satellite rather than just one.

Environment

dbt version: 1.5.2 automate_dv version: 0.10.1 Database/Platform: Snowflake

To Reproduce

Scenario: Duplicates loaded into Satellite
Given:  Stage contains two exact duplicate records.
And: A satellite already exists
And: The satellite does not contain the records.
And: apply_source_filter is set to TRUE

Stage:
| hash_key_column | arbitrary_context_columns | hash_diff                 | load_ts                   |
| 6cf8            | arbitrary data            | md5('arbitrary data')     | 2022-05-22 00:13:48.901   |
| 6cf8            | arbitrary data            | md5('arbitrary data')     | 2022-05-22 00:13:48.901   |

When: I load into the existing satellite.
Then: The satellite is loaded with both records.

Satellite:
| hash_key_column | arbitrary_context_columns | hash_diff                 | load_ts                   |
| 6cf8            | arbitrary data            | md5('arbitrary data')     | 2022-05-22 00:13:48.901   |
| 6cf8            | arbitrary data            | md5('arbitrary data')     | 2022-05-22 00:13:48.901   |

Expected behavior

Scenario: Duplicates not loaded into Satellite
Given:  Stage contains two exact duplicate records.
And: A satellite already exists
And: The satellite does not contain the records.
And: apply_source_filter is set to TRUE

Stage:
| hash_key_column | arbitrary_context_columns | hash_diff                 | load_ts                   |
| 6cf8            | arbitrary data            | md5('arbitrary data')     | 2022-05-22 00:13:48.901   |
| 6cf8            | arbitrary data            | md5('arbitrary data')     | 2022-05-22 00:13:48.901   |

When: I load into the existing satellite.
Then: The satellite should be loaded with a single record.

Satellite:
| hash_key_column | arbitrary_context_columns | hash_diff                 | load_ts                   |
| 6cf8            | arbitrary data            | md5('arbitrary data')     | 2022-05-22 00:13:48.901   |

The root cause When debugging the Satellite script, we noticed that in this CTE the rank() statement is ranking all duplicate records as 1 which causes the records to be loaded. As a suggestion, changing this to a row_number() would solve the issue.

DVAlexHiggs commented 10 months ago

Hi,

Thanks for this report!

This is something we've noticed internally on our own projects and have recently developed a fix for and hope to release in due course.

We've also decided to change RANK() for ROW_NUMBER() everywhere for the reason you describe as well as performance enhancements.

drewsonne commented 9 months ago

Hi @DVAlexHiggs that's good to know, any idea when you say "due course", what sort of time frame do you think you're looking at? Are we talking a month, 6 months, 2 years?

DVAlexHiggs commented 9 months ago

Hi @DVAlexHiggs that's good to know, any idea when you say "due course", what sort of time frame do you think you're looking at? Are we talking a month, 6 months, 2 years?

Apologies for the vagueness! The target is this month; we've got a few QoL, performance improvements and various other things coming down the line 😄

drewsonne commented 9 months ago

Amazing, thank you!

DVAlexHiggs commented 8 months ago

Fixed in v0.10.2 😄 Thanks for your patience for release of this! Please let us know if you experience any issues by responding here or opening a new issue.