Closed Manan007224 closed 3 years ago
This is not correct. One common mode of operations is to turn the source database to be read-only during "wait-for-cutover". This would allow you to run the checksum table verifier and get a "correct" result.
This is not correct. One common mode of operations is to turn the source database to be read-only during "wait-for-cutover". This would allow you to run the checksum table verifier and get a "correct" result.
Although the source database would be read-only during "wait-for-cutover" this shouldn't mean that the target database would no longer have any writes. The reason is :-
The binlog-writer stops once binlog-streamer has stopped running - https://github.com/Shopify/ghostferry/blob/master/ferry.go#L734-L742. The binlog-streamer only stops streaming once copydb instructs to stop it - https://github.com/Shopify/ghostferry/blob/master/copydb/copydb.go#L106.
The point at which copydb instructs binlog-streamer to stop, ghostferry is already in cutover
phase and not wait-for-cutover
- https://github.com/Shopify/ghostferry/blob/master/ferry.go#L734-L742
To conclude the above points we can be for sure that binlog-streamer and binlog-writer has stopped only in done
phase. Since binlog-writer and binlog-streamer can still be running during wait-for-cutover
phase we can't gurantee that there are not writes happening to the target database and hence ChecksumTable verifier can fail.
You're right in the absolute/mathematical sense: we cannot be sure that the binlog streamer and writer is done until the done
phase. After all, the OS scheduler could forever not schedule those goroutines and enter some sort of stuttering scenario such that the writes are forever buffered within Ghostferry. This, however, doesn't happen in practice. Generally, the binlog streamer and writer typically are done only a few seconds/minutes after you set the source db to be read-only. This is easy to verify by comparing the binlog streamer lag reported by Ghostferry with amount of time that passed since you made the DB read-only. After this verification, you can run the verifier at this stage, which should give you a "correct" result.
Additionally, there's an advantage to running this way: when Ghostferry is in wait-for-cutover
, binlog streaming is technically not terminated (as you mentioned), even though in practise there should be no event. This allows you to "resume" streaming the binlog should you decide to abort the cutover, without having to go through interrupt and resume.
We should assume that the users of Ghostferry understand how Ghostferry work and this "advanced" usage of Ghostferry should not be artificially restricted simply due to a theoretical race condition that is unlikely to be observed in practise.
Now, with all that said, we can question if we named the states correctly, or if we should introduce extra states within Ghostferry to make all of this more clear. Right now, it's common for copydb, that cutover happens during the wait-for-cutover
stage (which includes things like setting the source to read only, flipping the application to the new database, etc). Clicking "Allow automatic cutover" in the UI is simply a synonym to quit Ghostferry. This causes some confusion, even for me, and I'm interested in a way to refactor that part of the code.
Currently we allow running the verificaiton via webui if and only if ghostferry is in
wait-for-cutover
ordone
phase. We should not allowChecksumTable
verification if we're in thewait-for-cutover
becausewait-for-cutover
phase means that binlog-streamer is still running and neither source or target DB are read-only. Given that writes might be happening to source and targetChecksumTable
verification will definetly fail.This PR makes the
ChecksumTable
verification available only if ghostferry is indone
phase.