Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
511 stars 131 forks source link

[BUG] Effectivity Satellite macro not creating correct start and end dates #115

Closed klaus1978 closed 2 years ago

klaus1978 commented 2 years ago

Describe the bug When creating an Effectivity Sat I would expect proper start dates and end dates in the data. The current solution seems to have a bug when the data is flip floping. The problem is already visible in the test documentation: https://github.com/Datavault-UK/dbtvault/blob/develop/test/features/eff_sats/eff_sat_auto_end_dating_detail_inc.feature line 114 ff.

After customer has changed 3 times the final output for LOAD_DATETIME=2018-06-01 18:00:00.000 is wrong:

ORDER_CUSTOMER_PK CUSTOMER_PK ORDER_PK START_DATE END_DATE EFFECTIVE_FROM LOAD_DATETIME SOURCE
md5('1002||100') md5('1002') md5('100') 2018-06-01 09:00:00.000 2018-06-01 18:00:00.000 2018-06-01 18:00:00.000 2018-06-01 18:00:00.000 *
md5('1001||100') md5('1001') md5('100') 2018-06-01 00:00:00.000 9999-12-31 23:59:59.999 2018-06-01 18:00:00.000 2018-06-01 18:00:00.000 *

(compare to line 160/161 in the documentation)

--> e.g both lines would be valid for the time period between 09:00 and 18:00 o'clock.

Versions

dbt: 1.04 dbtvault: 0.8.2

To Reproduce as the problem is already visible in your tests I skip that part!

Expected behavior The correct output for the last load_datetime should be:

ORDER_CUSTOMER_PK CUSTOMER_PK ORDER_PK START_DATE END_DATE EFFECTIVE_FROM LOAD_DATETIME SOURCE
md5('1001||100') md5('1001') md5('100') 2018-06-01 00:00:00.000 2018-06-01 09:00:00.000 2018-06-01 00:00:00.000 2018-06-01 18:00:00.000 *
md5('1002||100') md5('1002') md5('100') 2018-06-01 09:00:00.000 2018-06-01 18:00:00.000 2018-06-01 18:00:00.000 2018-06-01 18:00:00.000 *
md5('1001||100') md5('1001') md5('100') 2018-06-01 18:00:00.000 9999-12-31 23:59:59.999 2018-06-01 18:00:00.000 2018-06-01 18:00:00.000 *

in words: so for order 100: 1001 is valid for 00:00 to 09:00 then 1002 is valid for 09:00 to 18:00 then 1001 is valid for 18:00 until end of time (be

Additional context

As far as I have seen one problem would be fixed by changing the eff_sat Macro creation: FROM:

new_reopened_records AS (
    SELECT DISTINCT
        lc."L_PLANT_CUSTOMER_PK",
        lc."PLANT_HK", lc."CUSTOMER_HK",
        **lc.**"STARTDATE" AS "STARTDATE",
        g."ENDDATE" AS "ENDDATE",
        g."DV_APPTS" AS "DV_APPTS",
        g."DV_LDTS",
        g."DV_RECSOURCE"
    FROM source_data AS g
    INNER JOIN latest_closed AS lc
    ON g."L_PLANT_CUSTOMER_PK" = lc."L_PLANT_CUSTOMER_PK"
    WHERE TO_DATE(g."ENDDATE") = TO_DATE('9999-12-31 23:59:59.999999')

TO:

new_reopened_records AS (
    SELECT DISTINCT
        lc."L_PLANT_CUSTOMER_PK",
        lc."PLANT_HK", lc."CUSTOMER_HK",
        **g.**"STARTDATE" AS "STARTDATE",
        g."ENDDATE" AS "ENDDATE",
        g."DV_APPTS" AS "DV_APPTS",
        g."DV_LDTS",
        g."DV_RECSOURCE"
    FROM source_data AS g
    INNER JOIN latest_closed AS lc
    ON g."L_PLANT_CUSTOMER_PK" = lc."L_PLANT_CUSTOMER_PK"
    WHERE TO_DATE(g."ENDDATE") = TO_DATE('9999-12-31 23:59:59.999999')
Additionally the macro has to be enriched by a cte that return old, already closed data, as in Additional context the first line: ORDER_CUSTOMER_PK CUSTOMER_PK ORDER_PK START_DATE END_DATE EFFECTIVE_FROM LOAD_DATETIME SOURCE
md5('1001||100') md5('1001') md5('100') 2018-06-01 00:00:00.000 2018-06-01 09:00:00.000 2018-06-01 00:00:00.000 2018-06-01 18:00:00.000 *
DVAlexHiggs commented 2 years ago

Hi! Thanks for this report and your time discussing on Slack to help us understand your issue. This is something we will verify and keep this thread updated on!

I edited your post to apply markdown formatting to the tables and code snippets so it's a little more readable, hope that's ok!

klaus1978 commented 2 years ago

Hi, ah great, sorry I am new to this markdown stuff...

Maybe this helps as well: I added another CTE (already_closed_CTE ) to the SQL which would create the part from Additional Context OF COURSE THIS IS NOT TESTED!

WITH source_data AS (
    SELECT a."L_PLANT_CUSTOMER_PK", a."PLANT_HK", a."CUSTOMER_HK", a."STARTDATE", a."ENDDATE", a."DV_APPTS", a."DV_LDTS", a."DV_RECSOURCE"
    FROM ANALYTICSLAYER.AN_DVSCM_STAGE_D.T_STR_SV_P41_P_T001W_SEF_L_CUSTOMER_PLANT_K41 AS a
    WHERE a."PLANT_HK" IS NOT NULL
    AND a."CUSTOMER_HK" IS NOT NULL
),

latest_records AS (
    SELECT b."L_PLANT_CUSTOMER_PK", b."PLANT_HK", b."CUSTOMER_HK", b."STARTDATE", b."ENDDATE", b."DV_APPTS", b."DV_LDTS", b."DV_RECSOURCE",
           ROW_NUMBER() OVER (
                PARTITION BY b."L_PLANT_CUSTOMER_PK"
                ORDER BY b."DV_LDTS" DESC
           ) AS row_num
    FROM INTEGRATIONLAYER.IN_DVSCM_D.SEF_CUSTOMER_PLANT_P41 AS b
    QUALIFY row_num = 1
),

latest_open AS (
    SELECT c."L_PLANT_CUSTOMER_PK", c."PLANT_HK", c."CUSTOMER_HK", c."STARTDATE", c."ENDDATE", c."DV_APPTS", c."DV_LDTS", c."DV_RECSOURCE"
    FROM latest_records AS c
    WHERE TO_DATE(c."ENDDATE") = TO_DATE('9999-12-31 23:59:59.999999')
),

latest_closed AS (
    SELECT d."L_PLANT_CUSTOMER_PK", d."PLANT_HK", d."CUSTOMER_HK", d."STARTDATE", d."ENDDATE", d."DV_APPTS", d."DV_LDTS", d."DV_RECSOURCE"
    FROM latest_records AS d
    WHERE TO_DATE(d."ENDDATE") != TO_DATE('9999-12-31 23:59:59.999999')
),

new_open_records AS (
    SELECT DISTINCT
        f."L_PLANT_CUSTOMER_PK", f."PLANT_HK", f."CUSTOMER_HK", f."STARTDATE", f."ENDDATE", f."DV_APPTS", f."DV_LDTS", f."DV_RECSOURCE"
    FROM source_data AS f
    LEFT JOIN latest_records AS lr
    ON f."L_PLANT_CUSTOMER_PK" = lr."L_PLANT_CUSTOMER_PK"
    WHERE lr."L_PLANT_CUSTOMER_PK" IS NULL
),

new_reopened_records AS (
    SELECT DISTINCT
        lc."L_PLANT_CUSTOMER_PK",
        lc."PLANT_HK", lc."CUSTOMER_HK",
        g."STARTDATE" AS "STARTDATE",
        g."ENDDATE" AS "ENDDATE",
        g."DV_APPTS" AS "DV_APPTS",
        g."DV_LDTS",
        g."DV_RECSOURCE"
    FROM source_data AS g
    INNER JOIN latest_closed AS lc
    ON g."L_PLANT_CUSTOMER_PK" = lc."L_PLANT_CUSTOMER_PK"
    WHERE TO_DATE(g."ENDDATE") = TO_DATE('9999-12-31 23:59:59.999999')
),

new_closed_records AS (
    SELECT DISTINCT
        lo."L_PLANT_CUSTOMER_PK",
        lo."PLANT_HK", lo."CUSTOMER_HK",
        lo."STARTDATE" AS "STARTDATE",
        h."DV_APPTS" AS "ENDDATE",
        h."DV_APPTS" AS "DV_APPTS",
        h."DV_LDTS",
        lo."DV_RECSOURCE"
    FROM source_data AS h
    INNER JOIN latest_open AS lo
    ON lo."PLANT_HK" = h."PLANT_HK"
    WHERE (lo."CUSTOMER_HK" <> h."CUSTOMER_HK")
),
already_closed_CTE as(
SELECT DISTINCT
        lc."L_PLANT_CUSTOMER_PK",
        lc."PLANT_HK", lc."CUSTOMER_HK",
        lc."STARTDATE" AS "STARTDATE",
        lc."ENDDATE" AS "ENDDATE",
        lc."DV_APPTS" AS "DV_APPTS",
        lc."DV_LDTS",
        lc."DV_RECSOURCE"
    FROM source_data AS g
    INNER JOIN latest_records AS lc
    ON g."L_PLANT_CUSTOMER_PK" = lc."L_PLANT_CUSTOMER_PK"
    WHERE TO_DATE(lc."ENDDATE") <> TO_DATE('9999-12-31 23:59:59.999999'))
,
records_to_insert AS (
    SELECT * FROM new_open_records
    UNION
    SELECT * FROM new_reopened_records
    UNION
    SELECT * FROM new_closed_records
    UNION
    SELECT * FROM already_closed_CTE    
)

select * from records_to_insert
DVAlexHiggs commented 2 years ago

No problem! Thank you

PulkitIL commented 2 years ago

@DVAlexHiggs : Could you please update the progress on the fix, I am also facing the similar issue when the data related to unique key flip-flops.

DVAlexHiggs commented 2 years ago

@DVAlexHiggs : Could you please update the progress on the fix, I am also facing the similar issue when the data related to unique key flip-flops.

Looking to get something out ASAP. We're aware this is a significant problem. We have a large amount of test cases to review and update to ensure that our fix works as intended.

DVAlexHiggs commented 2 years ago

Hello everyone (@PulkitIL, @klaus1978). Thanks for your patience whilst we work on fixing this issue, we understand it's high-impact and are working hard on a fix.

We've made a few changes to the macro and are re-running our full suite of (now corrected) effectivity satellite tests, looking good so far!

DVAlexHiggs commented 2 years ago

We've fixed this and all tests are passing for Snowflake. Now rolling it out to BigQuery and SQLServer and running our test suite against those too.

klaus1978 commented 2 years ago

awesome. Thanks for the fast reaction

DVAlexHiggs commented 2 years ago

@klaus1978 Fix released in dbtvault 0.8.3. Please let us know or reopen if the issue persists

PulkitIL commented 2 years ago

Thanks @DVAlexHiggs for quick turn around on this.