Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
478 stars 114 forks source link

[BUG] Bug with JOIN condition in records_to_insert clause for subsequent run in SAT table macro #197

Closed ssahoo-phdata closed 1 year ago

ssahoo-phdata commented 1 year ago

Describe the bug

Bug produced in DBT Cloud where the JOIN condition in records_to_insert clause for subsequent run in SAT table does not match the documentation produced code. Duplicate records coming from raw tables in the incremental runs are not filtered in the satellite.

Expected code in subsequent loads in SAT table macro

https://automate-dv.readthedocs.io/en/v0.9.4/macros/#example-output_3

records_to_insert AS (
    SELECT DISTINCT e.CUSTOMER_HK, e.HASHDIFF, e.CUSTOMER_NAME, e.CUSTOMER_PHONE, e.CUSTOMER_DOB, e.EFFECTIVE_FROM, e.LOAD_DATE, e.SOURCE
    FROM source_data AS e
    LEFT JOIN latest_records
        ON latest_records.CUSTOMER_HK = e.CUSTOMER_HK
    WHERE latest_records.HASHDIFF != e.HASHDIFF
        OR latest_records.HASHDIFF IS NULL
)

Produced Code

records_to_insert AS (
    SELECT DISTINCT stage.CUSTOMER_HK, stage.HASHDIFF, stage.CUSTOMER_NAME, stage.CUSTOMER_PHONE, stage.CUSTOMER_DOB, stage.EFFECTIVE_FROM, stage.LOAD_DATE, stage.SOURCE
    FROM source_data AS stage
    LEFT JOIN latest_records
        ON latest_records.CUSTOMER_HK = stage.CUSTOMER_HK AND latest_records.HASHDIFF = stage.HASHDIFF
    WHERE 
        latest_records.HASHDIFF IS NULL
)

Environment

dbt version: 1.4.6 automate_dv version: 0.9.5 Database/Platform: Snowflake

To Reproduce Steps to reproduce the behavior:

  1. Create an initial satellite table
  2. Run compile on satellite model
  3. Check the JOIN condition in records_to_insert clause for subsequent run
  4. Check the documentation produced code in Macros>sat section

Expected behavior Expected code in subsequent loads in SAT table macro

https://automate-dv.readthedocs.io/en/v0.9.4/macros/#example-output_3

records_to_insert AS (
    SELECT DISTINCT e.CUSTOMER_HK, e.HASHDIFF, e.CUSTOMER_NAME, e.CUSTOMER_PHONE, e.CUSTOMER_DOB, e.EFFECTIVE_FROM, e.LOAD_DATE, e.SOURCE
    FROM source_data AS e
    LEFT JOIN latest_records
        ON latest_records.CUSTOMER_HK = e.CUSTOMER_HK
    WHERE latest_records.HASHDIFF != e.HASHDIFF
        OR latest_records.HASHDIFF IS NULL
)

Screenshots If applicable, add screenshots to help explain your problem.

Log files If applicable, provide dbt log files which include the problem.

Additional context Add any other context about the problem here.

ssahoo-phdata commented 1 year ago

Update : Will be closing this bug as I got the cause for satellite not being able to filter out the duplicate records coming in subsequent runs . This is because the duplicate records coming in to source tables have distinct LOAD_DATETIME values , so the satellite considers the duplicates record to be distinct records and duplicates get inserted to satellites.