Currently when a physical backup is performed, journal segment is switched from N to N+1 at the backup start so that backup file is ensured to contain only data up to sequence N (including it). However, some long-running writeable transaction could already have some its changes stored in segments <= N while a commit event will be stored in some later segment. After re-initialization at the replica side, we continue with segment N+1 and (a) have older changes lost and (b) error "Transaction X is not found" usually happens. It means that the replica is inconsistent and must be re-initialized again. But if the primary is under high load, this may happen over and over.
The solution is to not delete segments <= N immediately, but instead scan them to find the active transactions at the end of N, calculate the new replication OAT, delete everything < OAT and replay the journal (active transactions only) starting with OAT, then proceed normally with N+1 and beyond.
Currently when a physical backup is performed, journal segment is switched from N to N+1 at the backup start so that backup file is ensured to contain only data up to sequence N (including it). However, some long-running writeable transaction could already have some its changes stored in segments <= N while a commit event will be stored in some later segment. After re-initialization at the replica side, we continue with segment N+1 and (a) have older changes lost and (b) error "Transaction X is not found" usually happens. It means that the replica is inconsistent and must be re-initialized again. But if the primary is under high load, this may happen over and over.
The solution is to not delete segments <= N immediately, but instead scan them to find the active transactions at the end of N, calculate the new replication OAT, delete everything < OAT and replay the journal (active transactions only) starting with OAT, then proceed normally with N+1 and beyond.