Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
511 stars 131 forks source link

[BUG] duplicates in effectivity satelites #231

Open pdaszczynski opened 7 months ago

pdaszczynski commented 7 months ago

Describe the bug When for two last records in stage for one src_dfk has not been changed src_sfk and between previous and current data processing for this particular src_dfk nothing changed eff_sat always insert one additional record to table.

Environment

dbt version: 1.7.8 automate_dv version: 0.10.2 Database/Platform: postgres

To Reproduce Steps to reproduce the behavior:

  1. Insert to stage like below, I have deleted *hk columns to not do the mess here
updated_date organisation_id customer_id start_date end_date effective_from load_date
2024-04-12 7 9491 2024-04-12 2024-04-19 2024-04-12 2024-04-24 08:18:21.882583 +00:00
2024-04-19 5 9491 2024-04-19 2024-04-21 2024-04-19 2024-04-24 08:18:21.882583 +00:00
2024-04-21 5 9491 2024-04-21 9999-12-31 2024-04-21 2024-04-24 08:18:21.882583 +00:00
  1. create effectivity satelite

    {{ config(materialized='incremental')  }}
    {%- set source_model = "stage_organisation_customer" -%}
    {%- set src_pk = "CUSTOMER_ORGANISATION_HK" -%}
    {%- set src_dfk = "CUSTOMER_HK"       -%}
    {%- set src_sfk = "ORGANISATION_HK"         -%}
    {%- set src_start_date = "START_DATE" -%}
    {%- set src_end_date =   "END_DATE"     -%}
    {%- set src_eff = "EFFECTIVE_FROM"    -%}
    {%- set src_ldts = "LOAD_DATE"    -%}
    {%- set src_source = "RECORD_SOURCE"  -%}
    {{ automate_dv.eff_sat(src_pk=src_pk, src_dfk=src_dfk, src_sfk=src_sfk,
                       src_start_date=src_start_date,
                       src_end_date=src_end_date,
                       src_eff=src_eff, src_ldts=src_ldts,
                       src_source=src_source,
                       source_model=source_model) }}
  2. run dbt couple of times

  3. See result in effectivity satelite

customer_hk organisation_hk start_date end_date load_date
003dd617-c12d-444f-f9c8-0f717c3fa982 8f14e45f-ceea-167a-5a36-dedd4bea2543 2024-04-12 2024-04-19 2024-04-24 07:50:07.914253 +00:00
003dd617-c12d-444f-f9c8-0f717c3fa982 e4da3b7f-bbce-2345-d777-2b0674a318d5 2024-04-19 2024-04-21 2024-04-24 07:50:07.914253 +00:00
003dd617-c12d-444f-f9c8-0f717c3fa982 e4da3b7f-bbce-2345-d777-2b0674a318d5 2024-04-21 9999-12-31 2024-04-24 07:50:07.914253 +00:00
003dd617-c12d-444f-f9c8-0f717c3fa982 e4da3b7f-bbce-2345-d777-2b0674a318d5 2024-04-21 9999-12-31 2024-04-24 08:02:21.036765 +00:00
003dd617-c12d-444f-f9c8-0f717c3fa982 e4da3b7f-bbce-2345-d777-2b0674a318d5 2024-04-19 2024-04-21 2024-04-24 08:02:28.049750 +00:00
003dd617-c12d-444f-f9c8-0f717c3fa982 e4da3b7f-bbce-2345-d777-2b0674a318d5 2024-04-21 9999-12-31 2024-04-24 08:02:37.983116 +00:00

Expected behavior Records like below should not been added all the time when I run dbt

customer_hk organisation_hk start_date end_date load_date
003dd617-c12d-444f-f9c8-0f717c3fa982 e4da3b7f-bbce-2345-d777-2b0674a318d5 2024-04-21 9999-12-31 2024-04-24 08:02:21.036765 +00:00
003dd617-c12d-444f-f9c8-0f717c3fa982 e4da3b7f-bbce-2345-d777-2b0674a318d5 2024-04-19 2024-04-21 2024-04-24 08:02:28.049750 +00:00
003dd617-c12d-444f-f9c8-0f717c3fa982 e4da3b7f-bbce-2345-d777-2b0674a318d5 2024-04-21 9999-12-31 2024-04-24 08:02:37.983116 +00:00

Additional context Not sure but I think this is because there is also same load_date for records and in final query partition by and order in this shape is not enough

latest_records AS (
    SELECT * FROM (
        SELECT b.CUSTOMER_ORGANISATION_HK, b.CUSTOMER_HK, b.ORGANISATION_HK, b.START_DATE, b.END_DATE, b.EFFECTIVE_FROM, b.LOAD_DATE, b.RECORD_SOURCE,
               ROW_NUMBER() OVER (
                    PARTITION BY b.CUSTOMER_ORGANISATION_HK
                    ORDER BY b.LOAD_DATE DESC
               ) AS row_num
        FROM "dbt_dv"."dv"."eff_sat_customer_organisation" AS b
        where customer_hk in (select customer_hk from stage.stage_organisation_customer where customer_id in (9491))
    ) AS inner_rank
    WHERE row_num = 1
)

AB#5346

DVAlexHiggs commented 7 months ago

Hello!

Thanks for this report. If possible can you please edit your post and format your data as tables and code as code snippets using markdown? This would make it easier to read and understand what the issue is.

If not we can edit your post for you however this will delay the process.

Thanks!

pdaszczynski commented 7 months ago

Hi Alex, I have updated as you requested.

DVAlexHiggs commented 7 months ago

Thank you. We believe this could be related to #221 but will need to investigate it further as this should have only affected standard satellites.

ImBharathwaj commented 1 week ago

Hi @DVAlexHiggs We are also facing the same issue while running dbt models in incremental manner. This issue persists even after upgrading automate_dv to 0.10.2. When can we expect the fix for this issue.