Open FrenkLoonen opened 1 year ago
Thanks for this report! Just to let you know it has not gone unnoticed 😄
Hi and thanks for reporting this. We've had some internal discussions and believe from the information you've provided here that it is working as intended.
A change in a PK/HK should mean a change in the record, so just because PK1 is not depended on by anything, the fact it is changing does mean something.
This is our understanding of what you have provided;
PK1 | PK2 | FK |
---|---|---|
123 | AAA | ZZZ |
123 | BBB | YYY |
456 | AAA | ZZZ |
src_pk: [PK1, PK2]
src_fk: [FK1, FK2]
...
src_pk: [PK1, PK2]
src_dfk: PK2
src_sfk: FK1, FK2
...
I would suggest that potentially, your data model might need re-thinking.
Saying all of this, after working through the above example, I am not sure how you would be configuring your models and what the metadata would look like to achieve what you have described. My above guess has a few issues.
Let's continue to discuss this and see if we can get to the bottom of it
I added a payload column to the example, and specified what my dbtvault config would look like. The Attr column is dependent on the PK of the source file ([PK1, PK2]). But the source file is actually already a join between several tables from the source, so I really had to do proper analysis to see what concepts are in there, which are the keys and which are attributes, and to which key each attribute belonged. So that's how I found out that FK is dependent only on PK2, which is why I modeled [PK2, FK] as a separate link with an eff_sat on top of it. If that's completely wrong, that could be my mistake. I would have wanted to receive separate source files with the individual concepts but unfortunately, going back to the source is not an option.
PK1 | PK2 | FK | Attr -- | -- | -- | -- 123 | AAA | ZZZ | 1A 123 | BBB | YYY | 1B 456 | AAA | ZZZ | 4ALink1
src_pk: [PK1, PK2]
src_fk: [PK1, PK2]
Sat1
src_pk: [PK1, PK2]
src_payload: [Attr]
Link2
src_pk: [PK2, FK]
src_fk: [PK2, FK]
Eff Sat
src_pk: [PK2, FK]
src_dfk: [PK2]
src_sfk: [FK]
Hi 👋 - I've run into a very similar (if not the same) issue with regular satellites (sat macro) sourcing from a table at a different grain.
Describe the bug
Consider one source table: Projects. A project has a Client attribute. The same Client value may associate to many Projects. I have a Client satellite (and related client hub), sourced from the Projects table, with src_pk defined as the hash of the Client attribute and src_hashdiff defined as the hash of client-specific attributes. Important note: src_eff defined as the project-level modified date because there is no client-level modified date available. Satellite is materialized as incremental model with on_schema_change='append_new_columns'.
When I run the satellite model for the first time (or full-refresh), the table gets duplicate rows with the same Hash Key + Change Key values. The duplicates correlate with Project rows.
For example, if there are 2 Projects in the source table with the same Client value, the satellite gets populated with two rows with the same Client HK and same Client CK.
This goes against my expectation that only unique rows (by HK+CK) would be loaded.
Environment dbt version: 1.5.2 dbtvault version: 0.9.2 Database/Platform: Snowflake
To Reproduce See bug description
Expected behavior On first run (or full-refresh) of Satellite model, the resulting table has unique HK+CK combinations as defined for that satellite.
Why I think this is happening I dug into the compiled SQL and I think this issue comes from the way dbt's incremental materialization works. Here is a slimmed-downed version of the compiled SQL:
WITH source_data AS (
SELECT a.CLIENT_HK, a.CLIENT_CK, a.<client_attribute_columns>, a.EFFECTIVE_FROM, a.LOAD_DATETIME, a.SOURCE
FROM <prime_stage_for_projects_view> AS a
WHERE a.CLIENT_HK IS NOT NULL
),
latest_records AS (
SELECT a.CLIENT_HK, a.CLIENT_CK, a.LOAD_DATETIME
FROM (
SELECT current_records.CLIENT_HK, current_records.CLIENT_CK, current_records.LOAD_DATETIME,
RANK() OVER (
PARTITION BY current_records.CLIENT_HK
ORDER BY current_records.LOAD_DATETIME DESC
) AS rank
FROM <sat_for_clients_table> AS current_records
JOIN (
SELECT DISTINCT source_data.CLIENT_HK
FROM source_data
) AS source_records
ON current_records.CLIENT_HK = source_records.CLIENT_HK
) AS a
WHERE a.rank = 1
),
records_to_insert AS (
SELECT DISTINCT stage.CLIENT_HK, stage.CLIENT_CK, stage.<client_attribute_columns>, stage.EFFECTIVE_FROM, stage.LOAD_DATETIME, stage.SOURCE
FROM source_data AS stage
LEFT JOIN latest_records
ON latest_records.CLIENT_HK = stage.CLIENT_HK
AND latest_records.CLIENT_CK = stage.CLIENT_CK
WHERE latest_records.CLIENT_CK IS NULL
)
SELECT * FROM records_to_insert;
On first run (or full-refresh), the records_to_insert
CTE results in all records from source_data (this is the prime stage view built on Projects table) because every row will meet the criteria: latest_records.CLIENT_CK IS NULL
...after all, there are no current records in the table on the first run. For a DV satellite, I would expect the records_to_insert CTE to select distinct only on stage.CLIENT_HK
and stage.CLIENT_CK
.
Workarounds
Questions Is there a recommended workaround for this? Is there a custom materialization available (or in the works) for satellites which solves this scenario?
Hopefully this is enough detail to understand the issue. Thank you!
Hi! We believe this is related to the improvements we recently made to standard satellites, which was a long-standing and fundamental restriction expecting 1 instance of a key per batch.
If users don't meet these expectations then unpredictable behavior will occur.
The recent improvement we've made to standard Satellites are something we're planning to roll out to the other satellite types, including Effectity Satellites.
I cannot 100% confirm that this is the root cause, but from everything contributed here I can make a fair assumption. All of our tests for Effectity Satellites are currently passing, however they assume the standard Data Vault 2.0 loading expectations (true delta feeds, only changes).
We appreciate the ongoing discussion and information provided, we'll make sure to test the existing effectivity satellites using the provided information to reproduce the issues being encountered, and of course add regression tests for these edge cases when a fix is in place.
Additional thought @EliasAthey are you including the collision key in your hash key? It looks like you're treating it as a composite PK instead.
I'm not by my desk right now so I can't look into this further but I suspect this would solve your issue. Whether or not the composite PK should be treated as unique (I think it should be and you shouldn't be seeing this issue regardless) will need to be looked at as well.
I've noticed the same thing on initial loads building using the effectivity sat on a stage view over the source table. It simply loads everything for the link hash key, even if the relationship hasn't changed. Our work around as been developing a dedicated raw stage model just to show distinct relationship changes over our source data and feed this to the eff macro (not ideal).
I'd have thought bringing in some of the features of the standard macro such as using a hash diff to only bring changes to the relationship and only adding new records when the hash diff changes on the driving key would be the way to go? That would remove the need to create a dedicated raw stage model just for housing data on a relationship change.
Describe the bug When I load an eff_sat with a driving key that is not the pk of that source table, I will end up with duplicates. Only happens when doing an initial load.
Environment
dbt version: 1.2.0 dbtvault version: 0.9.2 Database/Platform: Databricks, using the snowflake macro
To Reproduce Say I have a source model with 3 columns. PK1 and PK2 make up the PK of the model. FK is directly dependent only on PK2.
Model a link and an eff_sat on PK2+FK
Expected behavior In the link macro that will be used to load from PK2 & FK, a distinct happens, so AAA+ZZZ will only be loaded once. In the eff_sat macro, there are distincts as well, but not in the right place, imo. So now, AAA+ZZZ will be loaded twice. When I put a distinct in the source_data CTE on line 32, the duplicates are gone. Also, I think the other distincts (line 74, 97, 123, 144) can then be removed. (And maybe even change the unions in the records_to_insert CTE to union all?)
AB#5368