MeltanoLabs / Singer-Working-Group

Working group for ongoing development and iteration of the Singer Spec, the de-facto protocol for open source data connectors. Please use "Issues" to create discussion items - or use "Discussions" for general questions.
Apache License 2.0
13 stars 4 forks source link

Duplicate-Proof Incremental Replication #13

Open aaronsteers opened 2 years ago

aaronsteers commented 2 years ago

As part of the Singer spec, and also as a best practice to avoid records being skipped, taps are implemented with a greater-than-or-equal-to comparison against the replication key value. The "or equal to" part of the comparison is confusing to new users but the reason for it is important: to ensure record "ties" are never omitted from the final target.

The downside of this guarantee that Singer will not lose any records is that any ties will create duplicates in the downstream target, unless they are deduped or merged at the target level based on a primary key mechanism.

Existing mitigations

The current solution for this problem is (1) use primary key upserts on the target side, which will naturally solve the duplication problem or (2) use a solution like dbt to remove duplicates downstream.

Proposed improvement

  1. Continue, as today, to use "greater than or equal to" logic.
  2. Add to the STATE dictionary a new record_hashes_seen property (or similar) as a list of record hashes having the same value as the max replication key value.
    • As we parse each record during a stream, assuming its value is greater than or equal to the max replication key (which will be true for every record if the stream is sorted), then we can store a hash of the record into STATE along with the max replication key value.
  3. Next time through the sync, if the hash of a record exactly matches a hash in the set of record_hashes_seen, we can omit that record and not send it downstream.
  4. Any other ties by replication_key_value will be emitted downstream to the target, assuming their hash has not yet been seen/sent.
  5. When writing out STATE messages, only the latest "ties" need to be included in the record_hashes_seen. This would not be a cumulative list of all records, only the latest ties by replication key.

Best reasons not to build

The reasons not to build this are (1) performance, (2) complexity, and (3) scalability.

Regarding performance, the hashing of a record should be able to be performed very quickly, and then it is just a matter of tuning the caching and variable comparison logic. This should be tunable to reach satisfactory performance, but if not, we could also mitigate by enabling as an optional setting, such as dedupe_incremental_streams (bool), or similar.

Regarding complexity, the best mitigation is to promote a shared library and/or include in the SDK framework so that the code can be developed once and implemented broadly with minimal rework.

Regarding scalability, this solution should scale fine assuming a small number "ties" by replication key value. If the number of ties is in the thousands or larger, however, this could have adverse affects on the stability of STATE messages. An option to disable the behavior via settings (as described in the previous paragraph, could prove useful for this as well. That said, presumably the larger the number of ties (within reason), the higher the value of this deduplication capability.

Regarding adherance to Singer Spec

To my knowledge, this implementation would still adhere to the spec since (1) the STATE behavior is entirely up to the tap to control, (2) we still accomplish >= logic for replication keys, (3) we still guarantee that every record will be sent at least once.

Original Thread in SDK repo

Originally logged in the SDK project here. (No action has yet been taken.)