PeerDB-io / peerdb

Fast, Simple and a cost effective tool to replicate data from Postgres to Data Warehouses, Queues and Storage
https://peerdb.io
Other
2.17k stars 88 forks source link

Make resync more reliable for warehouse connectors #2017

Closed Amogh-Bharadwaj closed 4 weeks ago

Amogh-Bharadwaj commented 1 month ago

Overview

Resync is becoming an increasingly utilized feature. Users of our connectors, particularly Clickhouse, sometimes wish to make small table definition changes on target or source and then want to repopulate data. This is alongside the usual recovery use-cases of resync.

Failure points

RenameTables (or resync in general) currently can fail in the following ways:

  1. If, between original mirror kick off and resync, a column was added to the table and a row not inserted after, then that column would not have been added to the target via schema changes. In this case, upon resync, the _resync table and the original table have different schemas, causing the soft-delete transfer step to fail. This can then lead to:
  2. It processes some tables but not all and hits a failure - upon retrying it tries to resync the first table again but the _resynctable of it is dropped since it succeeded before.
  3. Resyncing again midway through a resync (initial load) can result in duplicate data in the _resync table if initial load of that _resync table was done already.

Fixes

In light of these scenarios, this PR puts in place the following guards:

TODO: