Make asynchronous replica re-initialization reliable

dyemanov commented 18 hours ago

Currently when a physical backup is performed, journal segment is switched from N to N+1 at the backup start so that backup file is ensured to contain only data up to sequence N (including it). However, some long-running writeable transaction could already have some its changes stored in segments <= N while a commit event will be stored in some later segment. After re-initialization at the replica side, we continue with segment N+1 and (a) have older changes lost and (b) error "Transaction X is not found" usually happens. It means that the replica is inconsistent and must be re-initialized again. But if the primary is under high load, this may happen over and over.

The solution is to not delete segments <= N immediately, but instead scan them to find the active transactions at the end of N, calculate the new replication OAT, delete everything < OAT and replay the journal (active transactions only) starting with OAT, then proceed normally with N+1 and beyond.

dyemanov commented 18 hours ago

It appears something went wrong with the diff, sorry. Will fix ASAP.

dyemanov commented 18 hours ago

Wrong branch was initially selected, the patch is against v5 but can be (should be, I'd say) back- and front-ported.

FirebirdSQL / firebird

Make asynchronous replica re-initialization reliable #8324