Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
478 stars 114 forks source link

[BUG] incremental loading method for satellites missing data #233

Open crystalgtaylor91 opened 1 month ago

crystalgtaylor91 commented 1 month ago

Describe the bug Incremental loading for a satellite with batch loaded data (multiple rows for the same pk, with different hashdiff, but the same load_datetime) doesn't work as it should. Because of the way the CTE is written this skips rows of unique data that should be included. This is because the window functions don't maintain ordering.

Environment

dbt version: 1.7.13 automate_dv version: 0.10.2 Database/Platform: Snowflake

To Reproduce Steps to reproduce the behavior:

  1. Create a raw staging table with multiple entries with the same PK and load_datetime, and different payloads
  2. Create a stg table using the staging table as source, hash the pk, and the payload columns
  3. Create a sat table, with incremental load and source filtering
  4. Check the row count for the stg and satelitte tables, these will be different from expected.

Example I used to replicate this error + potential solution. example.zip

Expected behavior All unique rows should be selected. This can be fixed by applying the row_number() function in the source_data CTE, and then using it in the order by clause of the LAG function in the unique_source_records CTE. See attached files for example. Unique rows are being filtered out because applying the window functions in two separate CTE's means that the row ordering is different for each CTE.

AB#5344

DVAlexHiggs commented 1 month ago

Thank you for this report! On the surface this does look like a legitimate bug - we will look to test this our end as soon as we can and get back to you