Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
511 stars 131 forks source link

[BUG] src_eff not used for ordering in satellites #253

Open mrjovana opened 1 week ago

mrjovana commented 1 week ago

Describe the bug Hello. Not sure if this is a bug or should be a feature, but aren't src_eff supposed to serve as additional ordering parameter? Like for getting latest_records and first_record_in_set. This way, without it, records that should not supposed to be there are loaded. Example data sets:

sat_example: id attr src_eff load_dt 1 test1 01.01.2024 12:00 20240102 1 test2 01.01.2024 15:00 20240102

staged incoming data: id attr src_eff load_dt 1 test2 02.01.2024 11:00 20240103 1 test3 02.01.2024 16:00 20240103

Result:

sat_example: id attr src_eff load_dt 1 test1 01.01.2024 12:00 20240102 1 test2 01.01.2024 15:00 20240102 1 test2 02.01.2024 11:00 20240103 1 test3 02.01.2024 16:00 20240103

Environment

dbt version: 1.8.7 automate_dv version: 0.11.0 Database/Platform: Big Query

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

test2 attribute shouldn't be loaded cause it's a last state of 20240102 load_dt and also first record in 20240103 (if it's ordered by load_dt and src_eff both)

I would say this should be expected behaviour.

Any feedback very appreciated :)

DVAlexHiggs commented 1 week ago

Hi! Thanks for your question and report. src_eff is an optional parameter and is used for post-processing downstream of the Satellite, allowing you to apply business rules and additional logic to decide what is effective and what isn't later on. In Data Vault 2.0, Satellites are no longer 'end dated' as this destroys the insert-only standards.

Essentially, it's there to provide the option to include bi-temporality in Satellites; but remember Satellites are still raw vault objects, so we do not apply any business logic to them until we get to the business vault or presentation layer.

Hope this helps. This is not a planned feature as it goes against standards and doesn't fit the philosophy of Satellites, so I will close this. Thank you

mrjovana commented 1 week ago

Hi @DVAlexHiggs , thanks for taking time to reply on this.

So basically, current sat macro can't support multiple attribute changes sent from source within same batch (load_dt), in a proper order manner? This implementation will cause bunch of duplicates in these cases.

DVAlexHiggs commented 1 week ago

Hi @DVAlexHiggs , thanks for taking time to reply on this.

So basically, current sat macro can't support multiple attribute changes sent from source within same batch (load_dt), in a proper order manner? This implementation will cause bunch of duplicates in these cases.

It's not the job of the satellite to apply business rules for which records are effective or not - it's a raw vault object - i.e. we load everything we can get. The satellite will load what it is given if there are hashdiff changes - these are not true duplicates.

The src_eff can then be used downstream to determine business logic and rules around the effectivity of records. A common approach is to create a 'latest satellite' view which applies a window function to the Satellite to just pull out the latest version of each record. You can use the src_eff to support additional logic here.

This is an integral part of the Data Vault standards and how it works, so I suggest exploring this topic further to deepen your understanding.

I hope this helps! If I haven't fully understood what you need, please feel free to share more details—I'm happy to assist further

mrjovana commented 1 week ago

Thanks again @DVAlexHiggs , but I am not sure if we fully understand each other here :) I wouldn't like to add any business logic to a satellite load, rather talking about something like "apply_date" concept, or sequence ordering - which are already recognized by DV standards - and for which src_eff might be handy.

No additional logic is added or expected, just pure ordering thing. Once sat needs to know what is the latest_record from current data, it will take last state of attributes for a BK, ordered by LOAD_DT. But what if BK had multiple changes in the same LOAD_DT? I don't think it would harm data vault 2.0 standards, if "ordered by LOAD_DT desc" is expanded as "ordered by LOAD_DT, src_eff desc" - so we can get really last record from current set. When no additional ordering parameter is used, than it's a random selection of a record from multiple ones with same LOAD_DT. And it's not really precise, and can harm hashdiff comparison...

So only thing I'm talking about is potentially adding src_eff (which can represent apply_dt or sequence_date from src) to order and get latest record from current set, and first record in incoming source data.

Then instead of this (which has test2 attribute/hashdiff as duplicate): sat_example: id attr src_eff load_dt 1 test1 01.01.2024 12:00 20240102 1 test2 01.01.2024 15:00 20240102 1 test2 02.01.2024 11:00 20240103 1 test3 02.01.2024 16:00 20240103

...test2 can be recognized as trully latest record from 20240102 load_dt, and as first record in 20240103, so it won't be loaded twice: sat_example: id attr src_eff load_dt 1 test1 01.01.2024 12:00 20240102 1 test2 01.01.2024 15:00 20240102 1 test3 02.01.2024 16:00 20240103

DVAlexHiggs commented 1 week ago

Ok thank you for further clarification; I need to verify a few things and double check that I understand your point. I'll get back to you with a more in-depth response as soon as I can! Thank you for your patience

DVAlexHiggs commented 6 days ago

EDIT

(see below for the original post)

@mrjovana Sorry - I realise now that the issue is because the satellite has already been loaded with two records for the same PK with the same LDTS - this is what you were saying about the records with the same LDTS.

The problem with this is that this is not an expected situation for Satellites; if you're loading multiple records in one batch with different HASHDIFF values then we'd also expect them to have different LDTS values - it's fine for them to be on the same day however in reality those two records' values would not have been updated at the same time - but perhaps a few minutes, or hours apart.

To summarise, it looks like the issue is stemming from the LDTS being a date and not a datetime which means the LAG as you correctly say, picks one at random. As far as I can see there are three approaches to solving this:

  1. Use the correct value for the LOAD_DATETIME so that the records do order correctly and show a timeline instead of appearing to have all happened at the same time due to only being a date - it could be that actually your EFFECTIVE_FROM is what you should be using for your LOAD_DATETIME

  2. Use the EFFECTIVE_FROM in downstream business logic to de-dupe, as previously mentioned.

  3. Another way to handle this is to create a layer prior to staging which applies hard rules at staging. This logic will involve converting your LOAD_DATE to a DATETIME instead of a DATE and using the EFFECTIVE_FROM to order and artificially increment the LOAD_DATETIME in nanoseconds to mimic the correct timeline according to the EFFECTIVE_FROM

    e.g.

    ID ATTR HASHDIFF EFFECTIVE_FROM LOAD_DATETIME
    1 test2 BBB 2024-01-02 11:00:00.000 2024-01-03 00:00:00.000
    1 test3 CCC 2024-01-02 16:00:00.000 2024-01-03 00:00:00.001

    instead of:

    ID ATTR HASHDIFF EFFECTIVE_FROM LOAD_DATE
    1 test2 BBB 2024-01-02 11:00:00.000 2024-01-03
    1 test3 CCC 2024-01-02 16:00:00.000 2024-01-03

Examples

I ran some sample scenarios through our test suite and found that it works as expected when we have expected LDTS values, please see below.

Example 1

Using your first example, this works:

Satellite (Pre-load)

ID ATTR HASHDIFF EFFECTIVE_FROM LOAD_DATETIME
1 test1 AAA 2024-01-01 12:00:00.000 2024-01-02 00:00:00.000
1 test2 BBB 2024-01-01 15:00:00.000 2024-01-02 00:00:00.001

Stage (To be loaded)

ID ATTR HASHDIFF EFFECTIVE_FROM LOAD_DATETIME
1 test2 BBB 2024-01-02 11:00:00.000 2024-01-03 00:00:00.000
1 test3 CCC 2024-01-02 16:00:00.000 2024-01-03 00:00:00.001

Outcome (Satellite)

ID ATTR HASHDIFF EFFECTIVE_FROM LOAD_DATETIME
1 test1 AAA 2024-01-01 12:00:00.000 2024-01-02 00:00:00.000
1 test2 BBB 2024-01-01 15:00:00.000 2024-01-02 00:00:00.001
1 test3 CCC 2024-01-02 16:00:00.000 2024-01-03 00:00:00.001

Example 2

This is another example where we try to load a duplicate record (same HASHDIFF, PK combination).

If we don't have the .001 nanoseconds in the BBB record, then the outcome is the "To be loaded" record being loaded instead of the below outcome which does work:

Satellite (Pre-load)

ID ATTR HASHDIFF EFFECTIVE_FROM LOAD_DATETIME
1 test1 AAA 2024-01-01 12:00:00.000 2024-01-02 00:00:00.000
1 test2 BBB 2024-01-01 15:00:00.000 2024-01-02 00:00:00.001

Stage (To be loaded)

ID ATTR HASHDIFF EFFECTIVE_FROM LOAD_DATETIME
1 test2 BBB 2024-01-02 11:00:00.000 2024-01-03 00:00:00.000

Outcome (Satellite)

ID ATTR HASHDIFF EFFECTIVE_FROM LOAD_DATETIME
1 test1 AAA 2024-01-01 12:00:00.000 2024-01-02 00:00:00.000
1 test2 BBB 2024-01-01 15:00:00.000 2024-01-02 00:00:00.001

Closing

Am I now correct in my understanding? Sorry for the back and forth and confusion here! I am determined to ensure we are aligned and arrive at a solution though! Thank you.

ORIGINAL REPLY

Hello again, @mrjovana,

I’ve reviewed the back-end code and reproduced your initial example in our test suite. Here's your original scenario:

Satellite (Pre-load)

ID ATTR HASHDIFF EFFECTIVE_FROM LOAD_DATETIME
1 test1 AAA 2024-01-01 12:00:00.000 2024-01-02 00:00:00.000
1 test2 BBB 2024-01-01 15:00:00.000 2024-01-02 00:00:00.000

Stage (To be loaded)

ID ATTR HASHDIFF EFFECTIVE_FROM LOAD_DATETIME
1 test2 BBB 2024-01-02 11:00:00.000 2024-01-03 00:00:00.000
1 test3 CCC 2024-01-02 16:00:00.000 2024-01-03 00:00:00.000

Loaded Satellite (Current Implementation)

ID ATTR HASHDIFF EFFECTIVE_FROM LOAD_DATETIME
1 test1 AAA 2024-01-01 12:00:00.000 2024-01-02 00:00:00.000
1 test2 BBB 2024-01-01 15:00:00.000 2024-01-02 00:00:00.000
1 test2 BBB 2024-01-02 11:00:00.000 2024-01-03 00:00:00.000
1 test3 CCC 2024-01-02 16:00:00.000 2024-01-03 00:00:00.000

Loaded Satellite (Expected Implementation)

ID ATTR HASHDIFF EFFECTIVE_FROM LOAD_DATETIME
1 test1 AAA 2024-01-01 12:00:00.000 2024-01-02 00:00:00.000
1 test2 BBB 2024-01-01 15:00:00.000 2024-01-02 00:00:00.000
1 test3 CCC 2024-01-02 16:00:00.000 2024-01-03 00:00:00.000

After internal discussions, I can confirm this is indeed a bug. I’d like to provide some clarification regarding the confusion around the src_eff column.

My initial misunderstanding stemmed from interpreting your request as a need to use the src_eff column to address the issue. However, this column is intended solely as metadata and should not be involved in SQL logic during the Satellite load process. It is used downstream in business logic or rules, as previously mentioned.

Additionally, I misunderstood your scenario as having duplicate records with identical LDTS values, suggesting that src_eff was needed for ordering. However, in your case, the records' LDTS values are a day apart. Handling multiple records with the same PK and LDTS but different HASHDIFFs is a separate issue that requires a different approach.

Regarding the bug itself, I believe it is simpler than initially thought: if a record being loaded has a HASHDIFF identical to the latest existing HASHDIFF in the Satellite, it should not be loaded—unless it is part of a known "flip-flop" sequence. This is just a non-brainer really and are how Satellites are supposed to work. In your case, since the HASHDIFF of the record (test2/BBB) matches the latest record in the Satellite, it should not be loaded again. However, the test3/CCC record should indeed be loaded.

Does this all make sense? Have I understood correctly?

Thank you for your patience, and your detailed example was really helpful in identifying and understanding this issue.