Support replicating from READ-ONLY DB

kolbitsch-lastline commented 4 years ago

The current ghostferry-copydb default behavior is to use set the RowLock property on the data-iterator when copying over rows from the source to the target DB. Unfortunately the reason why is not documented very well (that I could find).

This default behavior means that it is not possible to replicate from a MySQL slave DB, because such a server typically is run in READ-ONLY mode, preventing the SELECT ... FOR UPDATE used by ghostferry.

My assumption is that this is to keep the data consistent between the reading and the row verification. Beyond verification, it should be safe to operate without the FOR UPDATE (which also improves performance, as we don't require the round-trips for the transaction, which is always rolled back anyways), because any modifications on the source DB overlapping with a row read should be "fixed" by the binlog-writer anyways. Can you confirm my assumption is correct?

If so, would it make sense to disable the use of RowLock = true if no verifier is enabled, or at least allow the DataIterator to allow disabling the row-lock as part of the constructor?

shuhaowu commented 4 years ago

We definitely use Ghostferry from read-only replica internally and have never encountered any issue. In fact, FOR UPDATE is necessary even for a read replica as there are still write traffic due to the replication.

RowLock is definitely needed for data correctness purposes, not just for verification. As I recall it's not an intuitive proof why it would cause data corruption. Since this is a common question asked by many people, I can try to find a "prose proof" that I've written a long time ago from our documents about Ghostferry and paste it into here in the nearish future.

shuhaowu commented 4 years ago

I just reran the TLC against the TLA+ specification by making the following change and regenerating the TLA+ from PlusCal:

--- a/tlaplus/ghostferry.tla
+++ b/tlaplus/ghostferry.tla
@@ -285,8 +285,8 @@ PossibleRecords == Records \cup {NoRecordHere}
       \* but this seems cumbersome and prone to implementation level error.
       \* TODO: model this with TLA+ and validate its correctness.
       tblit_loop:  while (lastSuccessfulPK < MaxPrimaryKey) {
-      tblit_rw:      currentRow := SourceTable[lastSuccessfulPK + 1];
-                     if (currentRow # NoRecordHere) {
+      tblit_r:       currentRow := SourceTable[lastSuccessfulPK + 1];
+      tblit_w:       if (currentRow # NoRecordHere) {
                        TargetTable[lastSuccessfulPK + 1] := currentRow;
                      };
       tblit_upkey:   lastSuccessfulPK := lastSuccessfulPK + 1;

Each label in PlusCal is an atomic step. If the read and write during the data copy is not in an atomic step, they would be in 2 labels as above. This simulates a scenario with when FOR UPDATE is turned off. As a result, TLC tells me I have an invariant violation. It takes 26 steps in the model to get to this error condition and a runtime of 2min on my computer.

At one point I had a prose explanation of why this is the case but it's been many years and I have since forgotten. Unfortunately the TLC trace showing how the errors can be reproduced is a bit daunting and difficult to parse. I don't have time for it for now so I'm going to paste it into here verbatim. In the future I might post a better explanation of it by parsing the TLC output.

tlc-no-for-update-trace.txt

The key line we can see is at the 26th step, the Target and Source tables have different entries:

/\ pc = ( Ferry :> "Done" @@
  Application :> "Done" @@
  BinlogStreamer :> "Done" @@
  TableIterator :> "Done" @@
  BinlogWriter :> "Done" )
....
/\ TargetTable = <<r0, r1, NoRecordHere>>
....
/\ SourceTable = <<r0, r0, NoRecordHere>>

kolbitsch-lastline commented 4 years ago

We definitely use Ghostferry from read-only replica internally and have never encountered any issue. In fact, FOR UPDATE is necessary even for a read replica as there are still write traffic due to the replication.

that is very interesting. The SELECT ... FOR UPDATE definitely raises exceptions in our environment.

Could it be that you are connecting to the source DB using a user with the SUPER privilege? These users should always have the ability to lock for write (actually even write) to the DB.

shuhaowu commented 4 years ago

I just checked and yes we have SUPER.

kolbitsch-lastline commented 4 years ago

ok, thanks for confirming.

I'll have to think about this for a bit. I dislike giving our replication SUPER on the source

shuhaowu commented 4 years ago

The permissions you need are:

"SELECT", "SUPER", "REPLICATION SLAVE", "REPLICATION CLIENT"

kolbitsch-lastline commented 4 years ago

good to know - would be good to include in the docs somewhere.

NOTE: For me this poses a problem, as we're using GCP and cloud-SQL does not support SUPER

https://cloud.google.com/sql/faq

Cloud SQL does not support SUPER privileges, which means that GRANT ALL PRIVILEGES statements will not work. As an alternative, you can use GRANT ALL ON %.*.

I'll see what I can do in our environment

xliang6 commented 4 years ago

My assumption is that this is to keep the data consistent between the reading and the row verification. Beyond verification, it should be safe to operate without the FOR UPDATE (which also improves performance, as we don't require the round-trips for the transaction, which is always rolled back anyways), because any modifications on the source DB overlapping with a row read should be "fixed" by the binlog-writer anyways. Can you confirm my assumption is correct?

As @shuhaowu pointed out above, SELECT...FOR UPDATE is needed for maintaining the data correctness, not just for the row verification. Let's take a look at a possible sequence of actions that will introduce a data corruption with plain SELECT. We will use a similar notation as introduced in another ghostferry issue. We denote the database row values with r0 and r1, where r0 represents one set of values and r1 represents a different set of values. We define a "payload" as either an argument that an action is given (e.g. INSERT v1 INTO ...) or a result that an action receives (e.g. SELECT FROM ... -> v1).

Step	Actor	Action	Payload	Source	Target
1	Application	INSERT (SOURCE)	r1	r1	nil
2	DataIterator	BEGIN	N/A	r1	nil
3	DataIterator	SELECT	r1	r1	nil
4	Application	UPDATE (SOURCE)	r0	r0	nil
5	BinlogWriter	UPDATE	r1->r0	r0	nil
6	BatchWriter	BEGIN	N/A	r0	nil
7	BatchWriter	INSERT IGNORE	r1	r0	r1
8	BatchWriter	ROLLBACK	N/A	r0	r1
9	DataIterator	ROLLBACK	N/A	r0	r1

As we can see in step 4 above, the application is allowed to update the row selected in step 3 because FOR UPDATE is not enforced. A binlog UPDATE event is then created based on step 4. However, it silently fails to execute step 5 to update the target DB because it must perform r1->r0 and the old row value r1 doesn't exist in the target DB yet. Step 7 then writes the old stale row value r1 (obtained in step 3) to the target DB. At this point, the target DB contains r1, which is different from what is in the source DB, namely r0. If we have a verifier enabled, we will detect the data inconsistency and fail the ghostferry run. Hope this helps and please feel free to ask if you have any questions.

cc @Shopify/pods

kolbitsch-lastline commented 4 years ago

thanks for the detailed explanation! Indeed not the most straight-forward sequence of events or something one would think of immediately :-)

I agree that locking is the correct strategy, no doubt there. The problem in my scenario is that I cannot use locking, because of the environment in which I deploy (namely reading from a read-only GCP cloud-SQL slave DB). Not even a root/SUPER user has the privileges to lock on such a system.

From your explanation above, it seems to be related to transactions and rollbacks. I'm not entire sure (but have the strong feeling) that a rolled back transaction would ever make it onto the slave, and the scenario would not occur (because the master reverts and thus never commits the transaction commands to the binlog). Does that sound reasonable?

Also note that our application/use-case guarantees that replicating from a slave is safe. Fail-overs will not occur unless it's guaranteed that the slave has received all data.

hkdsun commented 4 years ago

I think there's some misunderstanding here about the race condition @kolbitsch-lastline.

The fundamental issue is that the BatchWriter (aka DataIterator / TableIterator) requires the source data to remain unchanged while it commits its writes to the target database. Breaking the atomicity of "read data from source then write data to target" is what causes data corruption.

If there was no atomicity in the read-then-write operation, this is the race condition you'd encounter:

Table Iterator reads source row with data (r1), and caches it in-memory
The application updates the source row (r1 -> r0) successfully. This creates a binlog event: b1(r1 -> r0)
- This is only allowed because the Table Iterator doesn't hold a transaction on the source row.
Binlog Streamer processes the event b1(r1 -> r0) and attempts to write b1(r1 -> r0) to target but it fails:
- The writer cannot write b1 to target because it must perform (r1 -> r0) exactly. In other words, the current value of the row in target must match r1 exactly at the time of replication, else the replication of the event will fail. At this point, the b1 event is thrown out the window and never played onto the target.
Table iterator finally writes the data it's been holding in its memory, r1, to the target.

If all processes (binlogstreamer, application, and table iterator) stop at this point, it's clear that the source table has r0 but target table has a value of r1

hkdsun commented 4 years ago

I hope the last reply clears up the race condition for you..

Now, onto your issue, which is the lack of permissions to hold a transaction open on a read slave.

First of all, in the example explained above, the "application" can be equated to the mysql replication thread on the read slave. Hence, the race condition still holds, whether you're running ghostferry off of a read slave (that is guarded from your "real" application) or a writer

Secondly, with the current ghostferry implementation I strongly suspect you're out of luck, which makes us sad too!

Finally, I don't think all hope is lost because internally we have discussed that mutually excluding the BinlogWriter from the DataIterator might be an alternative to having the SELECT ... FOR UPDATE transaction. That is, move the implementation of the atomicity up to the Ghostferry level. But again sadly, we are not motivated to undertake this work because (1) we don't have an immediate usecase for it and (2) it'll lift a lot of complexity up to the Ghostferry code.

In any case, if you want to explore this, we can support you but merging it upstream would require re-validation of the idea in TLA+ (link directs you to the current model for our "read-then-write atomicity")

Let us know your thoughts, and thanks again for all the discussion!

kolbitsch-lastline commented 4 years ago

FYI, I'm convinced I had responded to your explanation above, but I can't find it anywhere... not sure if it was deleted, if I posted on a wrong ticket, or if I was being stupid and just never hit Comment. Either way: thanks a ton for your explanation.

I finally get the problem and agree that there has to be a lock. I find that putting the lock into the ghostferry application would be slightly cleaner, as it's less likely to intervene with the application/source DB.

I'm currently working on testing a patch that I would love to see what you guys (and TLA+ ;-) ) have to say about.

pushrax commented 3 years ago

I agree that the application-level lock is cleaner. I'm also interested to see it modeled, since I'm quite sure there is a correct variation.

The current implementation has been validated extensively in production, and this does add some risk to change, but it is clear the change will benefit some environments.

Shopify / ghostferry

Support replicating from READ-ONLY DB #169