Implement two phase commit

canonical / dqlite

Embeddable, replicated and fault-tolerant SQL engine.

https://dqlite.io

Other

3.84k stars 216 forks source link

Implement two phase commit #46

Open freeekanayaka opened 6 years ago

freeekanayaka commented 6 years ago

When leadership is lost while applying a WalFrames command with commit=1, the Methods object has no way to tell if the log entry actually got committed or not. We should investigate implementing an additional command for two-phase commit, so we can be sure that the commit went through or not. See the TestIntegration_MethodHookLeadershipLost#wal frames sub-test.

paulstuart commented 4 years ago

@freeekanayaka, this issue is still a problem (I would like to help resolve it if I can). It's rather easy to recreate:

Create the test table create table simple (id integer primary key, other integer)
Start a long running loop that inserts the last known id value, e.g. insert into simple (other) values(?), where other is derived from results. LastInsertId(). The key point is that id and other should always be equal.
Start a second loop that iterates over the node numbers and transfers leadership to the next node.
There will be errors that occur in the primary loop, and on occasion there will be a mismatch between id and other. That is the bug in question.

paulstuart commented 4 years ago

In framesAbortBecauseLeadershipLost (replication.c), there's an if/else statement based on is_commit, but the handling is exactly the same for both cases.

freeekanayaka commented 4 years ago

Interesting breakdown. As first step, I'd suggest to put in place a unit test or at least a program that implements the procedure you outline and fails as you mention. With that at hand it should be easier to further investigate the issue, come up with a design for the solution, implement it and prove that it works (the unit test doesn't fail anymore).

The term "two phase commit" is probably inappropriate, as raft is by itself two-phase (a quorum is needed).

I suspect the issue here has more to do with client and server behavior when leadership is lost. The raft paper describes roughly what should happen: an operation ID should be maintained for each client request, if a request (such as committing the transaction performing the INSERT) fails then the client should retry the request, presenting the same operation ID to the new leader. In turn, the new leader should either perform the request, or no-op it if it turns out that the request was actually performed, but the client failed to receive the confirmation because the leader it initially submitted it to had died and could not notify the client back.

freeekanayaka commented 4 years ago

So far I've deferred addressing this issue since I suspect it requires a fair amount of thinking and work, however I still intend to nail when I'll have some time.

paulstuart commented 4 years ago

I'd like to do anything I can to lessen the load for you, as this is important to my project. Your original notes appear to be out of date, so any further brain dumps would be welcomed.

Testing this issue is a pain because it requires running a cluster under load and simultaneously hammering it with repeated transfers (or server restarts) until the magic moment occurs.

One thought was to add a "sleep" function to sqlite statements to create a long running transaction that is easier to test such actions mid-transaction. If you think that would be valuable I'd be happy to get that going.

freeekanayaka commented 4 years ago

Yes, those notes are out of date. The brain dump is basically what I wrote (assuming the issue is what I think it is), although that's admittedly hand-waving.

As said, coming up with a program that if ran long enough eventually reproduces the error would be probably a very good start.