Exclude more recent records from satellites based on hashkey or driving key

dlouseiro commented 1 year ago

The purpose of this PR is to adapt the code of regular and effectivity satellites to ignore records in the staging table when more recent records exist in the target table for the same hashkey (or driving key in case of effectivity satellites).

The previous version of the code is ignoring the records from the staging table when any new record exists in the target satellite, independently on whether that record has the same key or not, which causes issues when running processes in parallel.

The requirement still stands, as in, it is not intended for us to load "older versions" of records when new ones already exist in the target satellite, but this is only applicable for records with overlapping keys.

Example of a capturing process that can cause issues:

Capturing process foo fetches data from endpoint foo and populates three tables: one link l_foo_bar and one satellite ls_foo_bar.
The source endpoint allows one to pass a given instance of bar as argument (let's assume three possible instances bar1 and bar2)
The capturing is triggered in parallel for bar1 and bar2 (let's call them execution 1 and execution 2)
Executions 1 and 2 will populate the same tables, but have mutually exclusive records (execution 1 will only populate records in l_foo_bar with bar_id = bar1 and execution 2 will only populated records in l_foo_bar with bar_id = bar2)
With the current version of the code, if execution 2 starts 1 millisecond after execution 1, but execution 2 is the first one populating ls_foo_bar, all records from execution 1 will be ignored, causing data loss.
With the new version of the code, no record will be excluded as they touch different l_foo_bar_hashkeys.

dlouseiro commented 1 year ago

Let's update the changelog too ;)

True! Forgot that

dlouseiro commented 1 year ago

Done @matthieucan

PicnicSupermarket / diepvries

Exclude more recent records from satellites based on hashkey or driving key #42