Currently if there is a checksum failure, or the process is killed during checksum it requires a complete restart. Since checksums typically take about 10% of copier time, we might be talking about a day of progress lost.
What might be a worse issue is that the checksum uses a repeatable read transaction, which could be a day old. This will block the purge process, which could be a real issue.
I propose that instead of using a single checksum which consists of:
Acquire lock
Flush changes
Start consistent snapshots (x N)
Release lock
.. we instead repeat this process every N hours. This has the regrettable downside that there will be more locks, but for some systems 24hr+ long running transactions can bring everything to a grinding halt.
Currently if there is a checksum failure, or the process is killed during checksum it requires a complete restart. Since checksums typically take about 10% of copier time, we might be talking about a day of progress lost.
What might be a worse issue is that the checksum uses a repeatable read transaction, which could be a day old. This will block the purge process, which could be a real issue.
I propose that instead of using a single checksum which consists of:
.. we instead repeat this process every N hours. This has the regrettable downside that there will be more locks, but for some systems 24hr+ long running transactions can bring everything to a grinding halt.