cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.96k stars 3.79k forks source link

Consistency anomaly detected by a Jepsen bank transfer test #4687

Closed knz closed 8 years ago

knz commented 8 years ago

The bank test is a transfer test that should preserve the total sum. The anomaly is that the sum is not preserved.

The test scenario, as encoded in https://github.com/cockroachdb/jepsen/blob/master/cockroachdb/src/jepsen/cockroach.clj#L839:

create table accounts (id int not null primary key, balance bigint not null)
insert into accounts values (0, 10)
insert into accounts values (1, 10)
insert into accounts values (2, 10)
insert into accounts values (3, 10)
-- in a loop: transfer(rand, rand, rand) where:
   transfer(From, To, Amount) =
      begin
         b1 := (select balance from accounts where id = From) - Amount
         b2 := (select balance from accounts where id = To) + Amount
         if b1 >= 0 and b2 >= 0 then
            update accounts set balance = b1 where id = From
            update accounts set balance = b2 where id = To
         endif
      commit

And then also select balance from accounts from time to time to check the intermediate states of the table. That's how the inconsistency is detected.

The errors/traces: http://52.91.194.28/cgi-bin/display.py#20160226T122420 http://52.91.194.28/cgi-bin/display.py?path=cockroachdb-bank%2F20160226T122420.000Z%2Fresults.edn http://52.91.194.28/cockroachdb-bank/20160226T122420.000Z/history.txt

http://52.91.194.28/cgi-bin/display.py#20160226T121719 http://52.91.194.28/cgi-bin/display.py?path=cockroachdb-bank%2F20160226T121719.000Z%2Fresults.edn http://52.91.194.28/cockroachdb-bank/20160226T121719.000Z/history.txt

http://52.91.194.28/cgi-bin/display.py#20160226T121608 http://52.91.194.28/cgi-bin/display.py?path=cockroachdb-bank%2F20160226T121608.000Z%2Fresults.edn http://52.91.194.28/cockroachdb-bank/20160226T121608.000Z/history.txt

In all 3 cases the state becomes inconsistent at about the time a network partition is resolved (:nemesis :info :stop "fully connected")

Nothing special in the server logs (all available from the UI in the rightmost column at http://52.91.194.28/)

Perhaps not relevant, but I also noticed the following oddity on one node:

W160226 12:24:45.486123 storage/replica.go:1774 could not GC completed transaction: context deadline exceeded
W160226 12:24:47.699457 storage/replica.go:1873 unable to resolve local intents; range 0: replica <nil> not leader; leader is <nil>

Source: http://52.91.194.28/cgi-bin/display.py?path=cockroachdb-bank%2F20160226T122420.000Z%2Fn1l%2Fcockroach.stderr

tbg commented 8 years ago

Against which revision?

tbg commented 8 years ago

alpha.v1-455-ga05cb9c

tbg commented 8 years ago

http://52.91.194.28/cockroachdb-bank/20160226T122420.000Z/history.txt

4   :ok :transfer   {:from 3, :to 1, :amount 2}
0   :invoke :read   nil
0   :ok :read   [33 6 1 0]
2   :invoke :transfer   {:from 2, :to 1, :amount 1}
2   :ok :transfer   {:from 2, :to 1, :amount 1}
21  :ok :read   [33 6 1 0]
0   :invoke :transfer   {:from 2, :to 1, :amount 1}
0   :fail   :transfer   [:negative 2 -1]
23  :invoke :read   nil
2   :invoke :read   nil
2   :ok :read   [32 7 0 0]
0   :invoke :read   nil
0   :ok :read   [32 7 0 0]
:nemesis    :info   :stop   "fully connected"
4   :invoke :read   nil
4   :ok :read   [32 7 0 0]

This is the part of the log where things go bad - 0 and 2 both subtract one from the third account (which initially has only one) and erroneously don't push it into the negative. What does the error 0 :fail :transfer [:negative 2 -1] tell me?

bdarnell commented 8 years ago

That error means that the transfer did not occur because it would make one of the accounts negative. (format is [:negative $account $new_balance]). So that part of the log looks fine to me. The problem is that account 0 is decreasing from 33 to 32 even though it's not involved in any of the transfers visible in this snippet.

bdarnell commented 8 years ago

These two errors are very suspicious (The first is the conclusion of the 23 :invoke :read nil line quoted above). Where do we change isolation levels in this test?

23 :fail :read nil PSQLException: Cannot change transaction isolation level in the middle of a transaction.

43 :fail :transfer {:from 2, :to 0, :amount 1} PSQLException: Cannot change transaction isolation level in the middle of a transaction.

knz commented 8 years ago

@bdarnell the client code doesn't (I checked with network traces). The JDBC wrappers does emit prepare/bind/execute for SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL SERIALIZABLE before most its uses of BEGIN; but this oughts to make no difference.

knz commented 8 years ago

Some more:

http://52.91.194.28/cgi-bin/display.py?entry=20160226T152821#20160226T152821 http://52.91.194.28/cgi-bin/display.py?entry=20160226T161022#20160226T161022 http://52.91.194.28/cgi-bin/display.py?entry=20160226T161503#20160226T161503 http://52.91.194.28/cgi-bin/display.py?entry=20160226T161613#20160226T161613 http://52.91.194.28/cgi-bin/display.py?entry=20160226T161834#20160226T161834

tbg commented 8 years ago

This one here is interesting (from the original log, but will look into similar occurrences in the others)

4   :ok :read   [24 8 5 3]
0   :invoke :transfer   {:from 1, :to 0, :amount 0}
0   :ok :transfer   {:from 1, :to 0, :amount 0}
11  :invoke :transfer   {:from 0, :to 1, :amount 1}
4   :invoke :transfer   {:from 0, :to 1, :amount 1}
4   :ok :transfer   {:from 0, :to 1, :amount 1}
0   :ok :read   [23 9 5 3]
0   :ok :read   [23 5 9 3]
2   :ok :read   [23 5 9 3]

this is way up from the actual failure, but client 11s transfer of one never gets acknowledged, nor does it execute, but it matches the deduction from account 0 which causes the anomaly.

knz commented 8 years ago

Are you saying you don't see neither :ok :fail nor :timeout for it in the log? That is very strange. Perhaps a fault in the test logic too then. (But the inconsistency remains)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

tbg commented 8 years ago

I only see a single entry with actor 11 in the log. Maybe that guy just got stuck on his first transfer indefinitely somehow? Or maybe I'm misreading the log.

tbg commented 8 years ago

Just going to work my way through these. Snippets are the last "coherent" read to the next broken one.

http://52.91.194.28/cgi-bin/display.py?entry=20160226T152821#20160226T152821:

2   :ok :read   [24 0 13 3]
1   :invoke :transfer   {:from 2, :to 3, :amount 3}
1   :ok :transfer   {:from 2, :to 3, :amount 3}
1   :invoke :transfer   {:from 0, :to 2, :amount 0}
1   :ok :transfer   {:from 0, :to 2, :amount 0}
2   :invoke :transfer   {:from 2, :to 0, :amount 4}
2   :ok :transfer   {:from 2, :to 0, :amount 4}
24  :ok :transfer   {:from 3, :to 1, :amount 1}
24  :invoke :read   nil
24  :ok :read   [28 1 6 2] // off by [0, 0, 2, -4]

pending:

13  1  // client 13 has a write which never returned: 4 from 3 to 1 -> the deduction could match -4 above
4   1  // client 4 has a write which never returned: 1 from 1 to 2 -> the deposit could match +2 above
tbg commented 8 years ago

Next one: http://52.91.194.28/cgi-bin/display.py?entry=20160226T161022#20160226T161022

0       :ok     :read   [10 2 9 19]
24      :ok     :transfer       {:from 3, :to 1, :amount 1}
22      :invoke :transfer       {:from 0, :to 2, :amount 1}
22      :ok     :transfer       {:from 0, :to 2, :amount 1}
1       :invoke :transfer       {:from 1, :to 2, :amount 4}
1       :fail   :transfer       [:negative 1 -1]
3       :invoke :transfer       {:from 1, :to 2, :amount 4}
3       :fail   :transfer       [:negative 1 -1]
0       :invoke :transfer       {:from 1, :to 3, :amount 0}
0       :ok     :transfer       {:from 1, :to 3, :amount 0}
1       :invoke :transfer       {:from 1, :to 3, :amount 3}
1       :ok     :transfer       {:from 1, :to 3, :amount 3}
24      :invoke :transfer       {:from 0, :to 2, :amount 4}
24      :ok     :transfer       {:from 0, :to 2, :amount 4}
3       :invoke :transfer       {:from 0, :to 3, :amount 1}
3       :ok     :transfer       {:from 0, :to 3, :amount 1}
22      :invoke :read   nil
22      :ok     :read   [6 0 12 24]

pending:

Actor cur_actor:        1 ops
* 19    :info   :transfer       :timeout
* 19    :invoke :transfer       {:from 3, :to 0, :amount 2}
Actor cur_actor:        1 ops
* 4     :info   :transfer       :timeout
* 4     :invoke :transfer       {:from 2, :to 3, :amount 4}
Actor cur_actor:        1 ops
* 9     :info   :transfer       :timeout
* 9     :invoke :transfer       {:from 2, :to 3, :amount 2}

Here we would have expected to read [4, 0, 14, 22] but we got [6, 0, 12, 24]. Taking into account the never-acked (i.e. timeouted) ops, the first one likely executed (giving us the 6 on the first account). So we have [6,0,12,22] but see [6,0,12,24]. This could have been caused by the second lost write never happening, and the third one erroneously only depositing, but not deducting.

Do you happen to know whether both the updates and the final commit are sent individually? That would make a big difference for figuring out whether this is a replay-related issue or something with stale reads.

bdarnell commented 8 years ago

As far as I can tell there's no batching (j/update! returns the number of rows affected, even though it's not used here. The only way these statements could be transparently batched would be if there was something clever enough to see that the return value is discarded). The BEGIN might be batched with the first read, but the second read and both updates (if they occur) are each sent one at a time (waiting for each response), and then the COMMIT happens on its own.

I suspect that the problem has to do with mismanagement of the transaction status. We see both "Cannot change transaction isolation level in the middle of a transaction" and "ERROR: there is no transaction in progress". Each failure has at least one of these messages in (very) rough proximity to the first faulty read. (although this is far from conclusive because these messages occur in many passing runs as well). It looks like the "cannot change isolation" message comes from the JDBC driver while the "no transaction in progress" message comes from our server (the serverMsgReady message we send includes the transaction status, which is how the JDBC driver could believe a transaction was in progress).

tbg commented 8 years ago

Hmm. Are you suggesting that the transaction switches to SNAPSHOT or something like that? Seems weird. Maybe we should just write down everything that goes through Executor to a file? Or maybe we're accidentally nesting the transactions in some funny way?

bdarnell commented 8 years ago

No, I don't think it's switching to SNAPSHOT; there's nothing I can see that could do that. I think that in some cases we're failing to clear the session.Txn field so a future operation may inherit the ID of a transaction that never committed.

tbg commented 8 years ago

Interesting. Some of the code in executor about terminating transaction indeed did seem sketchy when I looked at it with @andreimatei the other day. Should be able to whip up a patch that removes some of the sketchiness.

andreimatei commented 8 years ago

Maybe we already removed the sketchiness? R, could you please try with https://github.com/cockroachdb/cockroach/pull/4576 patched in?

On Fri, Feb 26, 2016 at 5:36 PM, Tobias Schottdorf <notifications@github.com

wrote:

Interesting. Some of the code in executor about terminating transaction indeed did seem sketchy when I looked at it with @andreimatei https://github.com/andreimatei the other day. Should be able to whip up a patch that removes some of the sketchiness.

— Reply to this email directly or view it on GitHub https://github.com/cockroachdb/cockroach/issues/4687#issuecomment-189511781 .

knz commented 8 years ago

@andreimatei I don't get a clean merge on master, can you rebase?

knz commented 8 years ago

Now with complete network traces, starting with http://52.91.194.28/cgi-bin/display.py?entry=20160227T110036

(Rightmost column)

knz commented 8 years ago

@tschottdorf as you requested, a result with no zero transactions: http://52.91.194.28/cgi-bin/display.py?entry=20160227T112601

bdarnell commented 8 years ago

Thanks for the packet captures! Looking at this one, I've found some strange behavior. I've found looking at the tag field in the CommandComplete message a good way to get an overview of what's going on. Here's the tshark invocation to show all the tags:

tshark -r Downloads/trace.pcap -d tcp.port==26257,pgsql -T fields -e frame.number -e tcp.stream -e pgsql.tag  -Y pgsql.tag

This prints three columns: the frame number (for identification), the stream number (to make sure that we're not seeing a mixing of two TCP connections), and the command tag. Normal traffic has a few different transaction patterns:

Successful update:

212     1       BEGIN
219     1       SELECT 1
227     1       SELECT 1
235     1       UPDATE 1
243     1       UPDATE 1
248     1       COMMIT

Cancelled update (insufficient balance):

272     1       BEGIN
277     1       SELECT 1
282     1       SELECT 1
287     1       COMMIT

Aborted update (other errors; rollback shows up with an empty tag):

2683    1       BEGIN
2686    1       SELECT 1
2691    1       SELECT 1
2698    1       UPDATE 1
2710    1       

Read everything:

311     1       BEGIN
316     1       SELECT 4
321     1       COMMIT

And a surprising number of SET and SHOW commands dealing with the isolation level.

After an ordinary transaction commits at frame 4263, something goes off the rails. We see a bunch of consecutive SHOWs and SETs, then one transaction that does five selects and four updates:

4263    1       COMMIT
4269    1       SET
4275    1       SHOW 1
4282    1       SET
4286    1       BEGIN
4294    1       
4300    1       SHOW 1
4305    1       SHOW 1
4310    1       SHOW 1
4315    1       SHOW 1
4320    1       SET
4325    1       SET
4329    1       SET
4334    1       SET
4338    1       BEGIN
4348    1       
4353    1       BEGIN
4359    1       SELECT 1
4364    1       SELECT 1
4370    1       SET
4374    1       SELECT 1
4381    1       SELECT 1
4387    1       SET
4394    1       UPDATE 1
4401    1       UPDATE 1
4408    1       SELECT 1
4415    1       UPDATE 1
4422    1       UPDATE 1
4427    1       COMMIT

There is a seven-second gap between the BEGIN at frame 4286 and the rollback at frame 4294 (in between the transaction attempted to do the consistent read and failed after 7s with an "encountered future timestamp" error). The rollback at 4348 is another future-timestamp on a consistent read, although the time gap here is less than 1 second.

By looking at the timestamps in the error message, we can see that this corresponds to this segment of history.txt, which is also where the accounts go from 19+2+19+0=40 to 21+0+19+2=42.

0   :ok :read   [19 2 19 0]
1   :invoke :transfer   {:from 2, :to 0, :amount 0}
1   :ok :transfer   {:from 2, :to 0, :amount 0}
0   :invoke :transfer   {:from 0, :to 1, :amount 0}
24  :invoke :transfer   {:from 1, :to 3, :amount 2}
0   :fail   :transfer   {:from 0, :to 1, :amount 0} BatchUpdateException: Batch entry 0 UPDATE accounts SET balance = 2 WHERE id = 1 was aborted.  Call getNextException to see the cause.
ERROR: retry txn "sql/executor.go:453 sql" id=b17c022e key=/Table/51/1/0/2/1 rw=true pri=0.00539324 iso=SERIALIZABLE stat=PENDING epo=1 ts=1456570874.194177909,1 orig=1456570874.194177909,1 max=1456570874.443595548,0
24  :ok :transfer   {:from 1, :to 3, :amount 2}
42  :fail   :read   nil PSQLException: ERROR: read at time 1456570873.575008201,0 encountered previous write with future timestamp 1456570874.250578728,0 within uncertainty interval
23  :invoke :read   nil
24  :invoke :transfer   {:from 2, :to 1, :amount 3}
23  :ok :read   [21 0 19 2]

Clojure uses a mix of prepared and non-prepared statements so I don't have a slick way to dump that messed-up transaction, but here's a by-hand reconstruction:

4351 BEGIN; select balance from accounts where id = 2
4359 result: 19
4362 select balance from accounts where id = 3
4364 result: 2
4372 select balance from accounts where id = 2
4374 result: 19
4377 select balance from accounts where id = 0
4381 result: 19
4389 update accounts set balance = 17 where id = 2
4397 update accounts set balance = 2 where id = 3
4404 select balance from accounts where id = 1
4408 result: 0
4411 update accounts set balance = 19 where id = 2
4418 update accounts set balance = 21 where id = 0
4425 commit

So we have two transfers (actually 3, but the third didn't complete) executing simultaneously in the same transaction. One is moving 2 from account 2 to 0, and the other is "moving" 0 from 3 to 2, which has the effect of overwriting the previous change with its original value. Those transfers don't exist in the quoted snippet of history.txt, but attempted transfers do appear elsewhere in the file (by actors 32 and 37) which later show up as timeouts.

This is looking to me like a client bug (I've seen just such a bug in sqlalchemy's connection pool in the past): a connection is being returned to the pool after a timeout but the original thread is not properly interrupted, and eventually will continue while a new thread is using the same connection.

bdarnell commented 8 years ago

In this test, timeouts are implemented using this macro, which is sketchy. In the JVM, threads only check their interrupt status during certain operations, so it is possible for a thread to continue for some time after being interrupted. Whenever there is a timeout, we need to close the connection and open a new one. (It would be possible, if the underlying jdbc driver checks the interruption status in all the right places, for the interrupted thread to rollback its transaction and then acknowledge the interrupt so the connection can be used again, but that's tricky to get right).

tbg commented 8 years ago

:+1: great analysis @bdarnell. I still sat down and am porting the nemesis stuff to Docker, so we'll have this test in acceptance soon enough (and we'll also be able to run all the others with network partitions).

knz commented 8 years ago

OK thanks for the tip. I'll force opening a new connection on timeouts and see what happens.

knz commented 8 years ago

Ok it looks that after forcing reconnections, this issue disappears. @bdarnell many thanks for the weekend analysis!

tbg commented 8 years ago

Great!

On Sun, Feb 28, 2016, 07:18 kena notifications@github.com wrote:

Ok it looks that after forcing reconnections, this issue disappears. @bdarnell https://github.com/bdarnell many thanks for the weekend analysis!

— Reply to this email directly or view it on GitHub https://github.com/cockroachdb/cockroach/issues/4687#issuecomment-189855286 .

knz commented 8 years ago

Closed via https://github.com/cockroachdb/jepsen/pull/11.

knz commented 8 years ago

( #4036 )