Closed dlouseiro closed 3 years ago
I don't see a use-case for which we should revert this deduplication, it makes sense to me! Besides, shouldn't your example have the same hashdiff? If they are different both records would be inserted 🤔 hence not a problem.
With reverting it my main concern is that by letting it fail you have no control over the resolution of the bug (in another system).
These are two records in the same staging table, hence with the same r_timestamp
, so we cannot insert both given that the primary key would be violated.
If the records had the same hashdiff it would mean that the data in the row would have the exact same data set, so the loading would not fail. It would be deduplicated by the DISTINCT
.
I did not understand this part: "With reverting it my main concern is that by letting it fail you have no control over the resolution of the bug (in another system)."
Context
The current implementation of
diepvries
deduplicates the records stored in the staging table based on the record source and hashdiff when loading data into satellites. Although, this deduplication is not conceptually correct as it is not based in any operational timestamp, but solely in the content of the data (hashdiff
) in a quite random way.Example:
Imagining a DV process loading a satellite X that contains two fields: a
status
representing the status of entity X and amodified_timestamp
representing the modification timestamp of the record in the source system.This satellite contains the following records for entity X:
status='open'
,modified_timestamp='2021-07-29T00:00:00Z
,hashdiff = 'ABCDE'
;status='closed'
,modified_timestamp='2021-07-29T01:00:00Z'
,hashdiff='EDCBA'
.Record 2 is the most recent record based on the operational timestamp. Although, with our current implementation, Record 1 would be the one loaded in the satellite.
Solution
This deduplication strategy was reverted, letting the Data Vault process fail if duplicates are detected.