Shopify / ghostferry

The swiss army knife of live data migrations
https://shopify.github.io/ghostferry
MIT License
748 stars 70 forks source link

allow checksumtable verification during cutover only when overallstat… #255

Closed Manan007224 closed 3 years ago

Manan007224 commented 3 years ago

Currently we allow running the verificaiton via webui if and only if ghostferry is in wait-for-cutover or done phase. We should not allow ChecksumTable verification if we're in the wait-for-cutover because wait-for-cutover phase means that binlog-streamer is still running and neither source or target DB are read-only. Given that writes might be happening to source and target ChecksumTable verification will definetly fail.

This PR makes the ChecksumTable verification available only if ghostferry is in done phase.

shuhaowu commented 3 years ago

This is not correct. One common mode of operations is to turn the source database to be read-only during "wait-for-cutover". This would allow you to run the checksum table verifier and get a "correct" result.

Manan007224 commented 3 years ago

This is not correct. One common mode of operations is to turn the source database to be read-only during "wait-for-cutover". This would allow you to run the checksum table verifier and get a "correct" result.

Although the source database would be read-only during "wait-for-cutover" this shouldn't mean that the target database would no longer have any writes. The reason is :-

To conclude the above points we can be for sure that binlog-streamer and binlog-writer has stopped only in done phase. Since binlog-writer and binlog-streamer can still be running during wait-for-cutover phase we can't gurantee that there are not writes happening to the target database and hence ChecksumTable verifier can fail.

shuhaowu commented 3 years ago

You're right in the absolute/mathematical sense: we cannot be sure that the binlog streamer and writer is done until the done phase. After all, the OS scheduler could forever not schedule those goroutines and enter some sort of stuttering scenario such that the writes are forever buffered within Ghostferry. This, however, doesn't happen in practice. Generally, the binlog streamer and writer typically are done only a few seconds/minutes after you set the source db to be read-only. This is easy to verify by comparing the binlog streamer lag reported by Ghostferry with amount of time that passed since you made the DB read-only. After this verification, you can run the verifier at this stage, which should give you a "correct" result.

Additionally, there's an advantage to running this way: when Ghostferry is in wait-for-cutover, binlog streaming is technically not terminated (as you mentioned), even though in practise there should be no event. This allows you to "resume" streaming the binlog should you decide to abort the cutover, without having to go through interrupt and resume.

We should assume that the users of Ghostferry understand how Ghostferry work and this "advanced" usage of Ghostferry should not be artificially restricted simply due to a theoretical race condition that is unlikely to be observed in practise.

Now, with all that said, we can question if we named the states correctly, or if we should introduce extra states within Ghostferry to make all of this more clear. Right now, it's common for copydb, that cutover happens during the wait-for-cutover stage (which includes things like setting the source to read only, flipping the application to the new database, etc). Clicking "Allow automatic cutover" in the UI is simply a synonym to quit Ghostferry. This causes some confusion, even for me, and I'm interested in a way to refactor that part of the code.