PicnicSupermarket / diepvries

The Picnic Data Vault framework.
https://diepvries.picnic.tech
MIT License
126 stars 15 forks source link

Remove deduplication based on hashdiffs #14

Closed dlouseiro closed 3 years ago

dlouseiro commented 3 years ago

Context

The current implementation of diepvries deduplicates the records stored in the staging table based on the record source and hashdiff when loading data into satellites. Although, this deduplication is not conceptually correct as it is not based in any operational timestamp, but solely in the content of the data (hashdiff) in a quite random way.

Example:

Imagining a DV process loading a satellite X that contains two fields: a status representing the status of entity X and a modified_timestamp representing the modification timestamp of the record in the source system.

This satellite contains the following records for entity X:

Record 2 is the most recent record based on the operational timestamp. Although, with our current implementation, Record 1 would be the one loaded in the satellite.

Solution

This deduplication strategy was reverted, letting the Data Vault process fail if duplicates are detected.

dlouseiro commented 3 years ago

I don't see a use-case for which we should revert this deduplication, it makes sense to me! Besides, shouldn't your example have the same hashdiff? If they are different both records would be inserted 🤔 hence not a problem.

With reverting it my main concern is that by letting it fail you have no control over the resolution of the bug (in another system).

These are two records in the same staging table, hence with the same r_timestamp, so we cannot insert both given that the primary key would be violated.

If the records had the same hashdiff it would mean that the data in the row would have the exact same data set, so the loading would not fail. It would be deduplicated by the DISTINCT.

I did not understand this part: "With reverting it my main concern is that by letting it fail you have no control over the resolution of the bug (in another system)."