Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
478 stars 114 forks source link

[BUG] Idempotency issue with new v0.10.0 Satellite Loading approach #207

Closed DVAlexHiggs closed 10 months ago

DVAlexHiggs commented 10 months ago

Describe the bug When loading true duplicates into the v0.10.0 Satellites, duplicate data is not ignored

Environment

dbt version: v1.4.7 automate_dv version: v0.10.0 Database/Platform: All

To Reproduce

Scenario: [SAT-PM-IM-13] Idempotent intra day satellite load
    Given the SATELLITE_TZ table does not exist
    And the RAW_STAGE_TZ table contains data
      | CUSTOMER_ID | CUSTOMER_NAME | CUSTOMER_DOB | CUSTOMER_PHONE  | EFFECTIVE_FROM             | LOAD_DATE                  | SOURCE |
      | 1002        | Beth          | 1995-08-07   | 17-214-233-1215 | 2019-05-03 12:00:00.000000 | 2019-05-03 12:00:00.000000 | *      |
      | 1002        | Beth          | 1995-08-08   | 17-214-233-1215 | 2019-05-03 12:00:01.000000 | 2019-05-03 12:00:01.000000 | *      |
    And I stage the STG_CUSTOMER_TZ data
    When I load the SATELLITE_TZ sat
    Then the SATELLITE_TZ table should contain expected data
      | CUSTOMER_PK | HASHDIFF                                             | CUSTOMER_NAME | CUSTOMER_DOB | CUSTOMER_PHONE  | EFFECTIVE_FROM             | LOAD_DATE                  | SOURCE |
      | md5('1002') | md5('1995-08-07\|\|1002\|\|BETH\|\|17-214-233-1215') | Beth          | 1995-08-07   | 17-214-233-1215 | 2019-05-03 12:00:00.000000 | 2019-05-03 12:00:00.000000 | *      |
      | md5('1002') | md5('1995-08-08\|\|1002\|\|BETH\|\|17-214-233-1215') | Beth          | 1995-08-08   | 17-214-233-1215 | 2019-05-03 12:00:01.000000 | 2019-05-03 12:00:01.000000 | *      |
    And I load the SATELLITE_TZ sat
    Then the SATELLITE_TZ table should contain expected data
      | CUSTOMER_PK | HASHDIFF                                             | CUSTOMER_NAME | CUSTOMER_DOB | CUSTOMER_PHONE  | EFFECTIVE_FROM             | LOAD_DATE                  | SOURCE |
      | md5('1002') | md5('1995-08-07\|\|1002\|\|BETH\|\|17-214-233-1215') | Beth          | 1995-08-07   | 17-214-233-1215 | 2019-05-03 12:00:00.000000 | 2019-05-03 12:00:00.000000 | *      |
      | md5('1002') | md5('1995-08-08\|\|1002\|\|BETH\|\|17-214-233-1215') | Beth          | 1995-08-08   | 17-214-233-1215 | 2019-05-03 12:00:01.000000 | 2019-05-03 12:00:01.000000 | *      |

The above scenario ends up loading 4 records instead of the expected 2.

Expected behavior Only 2 records are loaded instead of 4. See the scenario above.

Additional context This was raised by a user on Slack and we are reporting this here to make the community aware.

We are actively working on a hotfix for release as soon as possible. Thank you to the community for reporting this!

How was this missed in your testing?

We have about 100 tests just for Satellites, around 20 of which were added for the new load approach.

Previously existing tests were passing and proved we had not broken the old behaviour for Satellites.

Whilst we had many test for idempotency, this exact situation was not being tested for.

DVAlexHiggs commented 10 months ago

Hi All! Just an update.

After some investigation this is not a bug as such. It comes from the fact the user was loading the exact same stage data twice.

This fixes we put in place allow for this within a batch but if you load a batch twice, this may still occur.

We will be putting some checks and changes in place to avoid loading the data again should this occur, however we recommend loading and staging Data according to the Data Vault standards; try to load different data in each load, using delta feeds ideally.

We'll keep this issue updated! Thanks

DVAlexHiggs commented 10 months ago

Hi everyone! Just a quick update. We have a fix and will be releasing this early next week after some further testing.

We will be releasing the fix as a config option which users will need to enable, rather than it being default behaviour. This is because it may have an impact on performance, and a future addition (water-levels) to AutomateDV will make this feature unnecessary in most use cases.

DVAlexHiggs commented 10 months ago

Fixed in v0.10.1