Datavault-UK / automate-dv

A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
https://www.automate-dv.com
Apache License 2.0
513 stars 131 forks source link

[BUG] Duplicate records in effectivity satellites (Incremental) #62

Closed yfzhangleo closed 2 years ago

yfzhangleo commented 3 years ago

Describe the bug For records closed (end_date does not equal '9999-12-31'), when you run the dbt run -m ${model_name} in the second time, it will insert the same records into the eff_sat table.

Versions

dbt: 0.20.0 dbtvault: 0.7.6.1

To Reproduce Steps to reproduce the behavior:

  1. Set a staging table with a record which end_date is not '9999-12-31'
  2. Set a effectivity satellite table which source_model is the staging table set in step 1
  3. Run dbt run -m ${model_name} twice
  4. See duplicated records in the effectivity satellite table

Expected behavior In the second dbt run, it shouldn't insert the same closed record again

Screenshots Screenshot from 2021-09-16 14-22-39

Log files If applicable, provide dbt log files which include the problem.

Additional context I think this should be a bug in dbtvault.eff_sat, https://github.com/Datavault-UK/dbtvault/blob/master/macros/tables/eff_sat.sql#L56

latest_closed AS (
    SELECT {{ dbtvault.alias_all(source_cols, 'd') }}
    FROM latest_records AS d
    WHERE TO_DATE(d.{{ src_end_date }}) != TO_DATE('9999-12-31')
),

{# Identifying the completely new link relationships to be opened in eff sat -#}
new_open_records AS (
    SELECT DISTINCT
        {{ dbtvault.alias_all(source_cols, 'f') }}
    FROM source_data AS f
    LEFT JOIN latest_records AS lr
    ON f.{{ src_pk }} = lr.{{ src_pk }}
    WHERE lr.{{ src_pk }} IS NULL
),

{# Identifying the currently closed link relationships to be reopened in eff sat -#}
new_reopened_records AS (
    SELECT DISTINCT
        lc.{{ src_pk }},
        {{ dbtvault.alias_all(fk_cols, 'lc') }},
        lc.{{ src_start_date }} AS {{ src_start_date }},
        g.{{ src_end_date }} AS {{ src_end_date }},
        g.{{ src_eff }} AS {{ src_eff }},
        g.{{ src_ldts }},
        g.{{ src_source }}
    FROM source_data AS g
    INNER JOIN latest_closed AS lc
    ON g.{{ src_pk }} = lc.{{ src_pk }}
),

For a record which status is closed (end_date doesn't equal '9999-12-31') in eff_sat, it should appear in latest_records. While if the status of the record doesn't change in the staging table, then in new_reopened_records, it will not be filtered out and will be inserted again.

Seems like adding a where clause to filter out end_date = TO_DATE('9999-12-31') in new_reopened_records can solve the problem.

DVAlexHiggs commented 3 years ago

Thanks for the report. Can you tell me what materialisation you are using please?

yfzhangleo commented 3 years ago

Thanks for the report. Can you tell me what materialisation you are using please?

Hi, the materialisation is view.

DVAlexHiggs commented 3 years ago

Thanks for the report. Can you tell me what materialisation you are using please?

Hi, the materialisation is view.

Thanks. Raw vault structures should never be materialised as views. This will break their behaviour because they require data to be stored over time. Please try it using incremental instead and let us know if you see the same behaviour.

yfzhangleo commented 3 years ago

Thanks for the report. Can you tell me what materialisation you are using please?

Hi, the materialisation is view.

Thanks. Raw vault structures should never be materialised as views. This will break their behaviour because they require data to be stored over time. Please try it using incremental instead and let us know if you see the same behaviour.

Sorry, I checked the value by mistake. It seems like we're using incremental.

DVAlexHiggs commented 3 years ago

Ok that makes sense. I think this is similar to Datavault-UK/dbtvault#50, in that the advised (though not well documented advice) materialisation is vault_insert_by_period and using incremental can cause some unpredictable behaviour. We need to document this better and also put some checks/safety nets in place to prevent things breaking quite so easily.

yfzhangleo commented 3 years ago

Ok that makes sense. I think this is similar to Datavault-UK/dbtvault#50, in that the advised (though not well documented advice) materialisation is vault_insert_by_period and using incremental can cause some unpredictable behaviour. We need to document this better and also put some checks/safety nets in place to prevent things breaking quite so easily.

I think they're two different scenarios. In Datavault-UK/dbtvault#50, the two records have different loaddate while in my case, one same record (every column is the same) will be inserted multiple times when the record is closed (end_date doesn't equal '9999-12-31') in both the staging model and the eff_sat model. I don't think using vault_insert_by_period will work.

ahuang0808 commented 3 years ago

Ok that makes sense. I think this is similar to Datavault-UK/dbtvault#50, in that the advised (though not well documented advice) materialisation is vault_insert_by_period and using incremental can cause some unpredictable behaviour. We need to document this better and also put some checks/safety nets in place to prevent things breaking quite so easily.

I think the issue here is the closed record has been inserted into eff_sat twice after dbt run.

As when we

new_reopened_records AS (
    SELECT xxx FROM source_data AS g
        INNER JOIN latest_closed AS lc
)

the records from source_data are not all open, which means there would possibly be some records in new_reopened_records are actually closed, then these closed records would be inserted into eff_sat again.

DVAlexHiggs commented 3 years ago

Understood. we'll double check and do some experimentation internally, as I'm sure we have tests for this that are passing.

DVAlexHiggs commented 3 years ago

Just conferred with a colleague. Effectivity Satellites have only ever been tested with vault_insert_by_period which is the recommended materialisation to use for all raw vault structures which contain a history (everything but Hubs and Links). Can you move to this materialisation and let us know if the problem is fixed? If not then this is definitely a bug.

The reason we advise the use of vault_insert_by_period instead of incremental is because vault_insert_by_period includes additional logic to ensure only the most recent (new, previously unloaded) records are being loaded into the raw vault. This is as per Data Vault 2.0 guidance; we don't want to be scanning/performing logic on the entire history every load, but only the difference/delta since the last load. The reason duplicates come in when using incremental is because we aren't reducing the size of the stage when loading into the raw vault: dbtvault will still attempt to load the full history each time.

DVAlexHiggs commented 3 years ago

Some further internal discussion:

Generally we do want to avoid suggesting vault_insert_by_period, however we have found many users are trying to load the entire stage history with each load. The easiest workaround is to use vault_insert_by_period, hence the suggestion.

We have mentioned in many places in the documentation that all loads should only ever be loading new records, not the whole history. In theory if the stage data is only ever new records, the incremental materialiation should work as expected, but we have a gap in our testing for this because we decided to go down the route of recommending vault_insert_by_period for tables containing history.

Moving forward we have decided to make incremental the base case, with the strict caveat that the stage tables need to contain the delta feeds as per Data Vault 2.0 recommendation and standards. This will be quite a big overhaul of some of the macros and structures in dbtvault, so it's not something we can do overnight. For now, please use vault_insert_by_period.

ahuang0808 commented 3 years ago

Okay, we'll try vault_insert_by_period to see if that works.

Thank you.

DVAlexHiggs commented 2 years ago

We will ensure that the methods for loading the Raw Vault are documented more clearly. This is not a bug as such, but a mismatch between the intended implementation of Data Vault 2.0 and dbtvault.

In short: A single load/batch (i.e. dataset present in the stage layer) should contain only a delta set of data; the changes since the last load. This means each batch should only have unique PKs. We can have 10 different PKs, but we cannot have any of those 10 PKs again in the same load.

Closing in favour of Datavault-UK/dbtvault-docs#44. Thank you.