ScalefreeCOM / datavault4dbt

Scalefree's dbt package for a Data Vault 2.0 implementation congruent to the original Data Vault 2.0 definition by Dan Linstedt including the Staging Area, DV2.0 main entities, PITs and Snapshot Tables.
https://www.scalefree.com/
Apache License 2.0
141 stars 27 forks source link

Issue with MA satellite #221

Closed MKloseINF closed 3 months ago

MKloseINF commented 3 months ago

MA satellite only uses hash key and hash diff to identify latest entries in sat. Multi active attribute(s) are ignored in partition clause. Using row number to qualify results in exactly one hash + hash diff key. But if you've multi active for same ldts in your satellite only one is identified and a similar row is loaded.

Example: Line is MA attribute

source data first load: HK |HD |Line | Quantity | LDTS ABCD | 5678 | 1 | 5 | 14.8.2024 ABCD | 3456 | 2 | -10 | 14.8.2024

MA sat after first load: HK |HD |Line | Quantity | LDTS ABCD | 5678 | 1 | 5 | 14.8.2024 ABCD | 3456 | 2 | -10 | 14.8.2024

source data second load (maybe same data because of no increment from source): HK |HD |Line | Quantity | LDTS ABCD | 5678 | 1 | 5 | 15.8.2024 ABCD | 3456 | 2 | -10 | 15.8.2024

MA sat after second load: HK |HD |Line | Quantity | LDTS ABCD | 5678 | 1 | 5 | 14.8.2024 ABCD | 3456 | 2 | -10 | 14.8.2024 ABCD | 5678 | 1 | 5 | 15.8.2024

Which line for second load will be identified as diff is random because of ROW function.

tkirschke commented 3 months ago

Hi @MKloseINF and thanks for reaching out!

We use the Block Hashdiff for Multiactivity, meaning in your example, all participants of one multi-active group have the same hashdiff:

source data first load: HK |HD |Line | Quantity | LDTS ABCD | 5678 | 1 | 5 | 14.8.2024 ABCD | 5678 | 2 | -10 | 14.8.2024

MA sat after first load: HK |HD |Line | Quantity | LDTS ABCD | 5678 | 1 | 5 | 14.8.2024 ABCD | 5678 | 2 | -10 | 14.8.2024

Then the second load would not be seen as a delta, and no insert happens. For more information about the block-hashdiff, generated by LISTAGG(), see our blog article here: https://www.scalefree.com/blog/data-vault/using-multi-active-satellites-the-correct-way-2-2/

Let me know if this helped!

MKloseINF commented 3 months ago

Hi @tkirschke,

unfortunately that won't help and macros/tables/snowflake/ma_sat_v0.sql didn't use LISTAGG function.

When looking at the Jinja macro for incremental the row number = 1 throws you only back one row (random) but ignores in above exampel LINE 1 and 2 (MA key). Records to insert part of sql statement identifies then 1 column as new.

{# Get the latest record for each parent hashkey in existing sat, if incremental. #} {%- if is_incremental() %} latest_entries_in_sat AS (

SELECT
    {{ parent_hashkey }},
    {{ ns.hdiff_alias }}
FROM 
    {{ this }}
QUALIFY ROW_NUMBER() OVER(PARTITION BY {{ parent_hashkey|lower }} ORDER BY {{ src_ldts }} DESC) = 1  

), {%- endif %}

tkirschke commented 3 months ago

Hi @MKloseINF ,

LISTAGG() or STRINGAGG() is used in the Staging model, as soon as multi-activity is configured. Can you confirm, that your stage model has the multi active config properly defined, as seen in Example 3 in the wiki: https://github.com/ScalefreeCOM/datavault4dbt/wiki/Staging#example-3

Please send the stage model and Ma sat v0 model, so we can help you better.

MKloseINF commented 3 months ago

yes I checked and everythings looks good from my point of view.

Stage: `{{ config(materialized="table") }}

{%- set yaml_metadata -%} source_model: 'DS_AFDPNLT0011' ldts: 'CURRENT_TIMESTAMP' rsrc: '!DS_AFDPNLT0011' hashed_columns: hk_faabstnk_h:

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{%- set source_model = metadata_dict["source_model"] -%} {%- set ldts = metadata_dict["ldts"] -%} {%- set rsrc = metadata_dict["rsrc"] -%} {%- set hashed_columns = metadata_dict["hashed_columns"] -%} {%- set derived_columns = metadata_dict["derived_columns"] -%} {%- set prejoined_columns = metadata_dict["prejoined_columns"] -%} {%- set missing_columns = metadata_dict["missing_columns"] -%}

{{ datavault4dbt.stage( source_model=source_model, ldts=ldts, rsrc=rsrc, hashed_columns=hashed_columns, derived_columns=derived_columns, prejoined_columns=prejoined_columns, missing_columns=missing_columns, ) }}`

MA Sat: `{{ config(materialized="incremental") }}

{%- set yaml_metadata -%} source_model: 'DS_STG_AFDPNLT0011' parent_hashkey: 'hk_AFDPNLT001_l' src_hashdiff: 'hd_AFDPNLT0011_s' src_ma_key:

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{{ datavault4dbt.ma_sat_v0( source_model=metadata_dict["source_model"], parent_hashkey=metadata_dict["parent_hashkey"], src_hashdiff=metadata_dict["src_hashdiff"], src_ma_key=metadata_dict["src_ma_key"], src_payload=metadata_dict["src_payload"], ) }} `

tkirschke commented 3 months ago

Hi @MKloseINF

The Multi Active Config seems fine, but it's not passed into the stage macro:

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{%- set source_model = metadata_dict["source_model"] -%}
{%- set ldts = metadata_dict["ldts"] -%}
{%- set rsrc = metadata_dict["rsrc"] -%}
{%- set hashed_columns = metadata_dict["hashed_columns"] -%}
{%- set derived_columns = metadata_dict["derived_columns"] -%}
{%- set prejoined_columns = metadata_dict["prejoined_columns"] -%}
{%- set missing_columns = metadata_dict["missing_columns"] -%}
{%- set multi_active_config = metadata_dict["multi_active_config"] -%}

{{
datavault4dbt.stage(
source_model=source_model,
ldts=ldts,
rsrc=rsrc,
hashed_columns=hashed_columns,
derived_columns=derived_columns,
prejoined_columns=prejoined_columns,
missing_columns=missing_columns,
multi_active_config=multi_active_config
)
}}`

Let me know if this solves the issue.

MKloseINF commented 3 months ago

indeed you're right. we took the "wrong" example code which is missing multi active config.

I added missing lines but now generated code throws errors, see attached file console_output.txt

does main_hashkey_column work with link hash key? Looking at example 3 there is hk_test used which is not specified in whole definition code

tkiehn commented 3 months ago

Hi @MKloseINF,

the error in your compiled code is caused by an additional comma, this was fixed here. Which version of datavault4dbt are you using?

Kind regards Theo

MKloseINF commented 3 months ago

Hi @tkiehn

we were on version 1.3.1....so I did the upgrade to 1.3.3. Additional comma error resolved, but now new error invalid identifier 'HK_AFDPNLT001_L' occurs. I can point it out where it comes from. We are using a MA satellite for link. The link hash key is calculated after the hash key, but we need the link hash key (HK_AFDPNLT001_L) in hash diff calculation which fails due to later calculation.

But it works changing model spec order...first link hash key, then hashdiff key

Thanks a lot for your support

MKloseINF commented 3 months ago

Thanks a lot