Closed cockroach-teamcity closed 7 years ago
This doesn't look particularly scary, but it does appear to be new. I'll be curious to see whether it happens again.
Bisected this to f67a508828a3faf667df1c2ebb72a77c23f2f525! Reproduces quite readily on my azworker with build/builder.sh make stress PKG=./pkg/storage TESTS=TestRaftRemoveRace STRESSFLAGS='-p 200'
.
SHA: https://github.com/cockroachdb/cockroach/commits/4e68800f6b47c22e142b48fdb390a0e5d249a8cf
Parameters:
TAGS=
GOFLAGS=-race
Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=253814&tab=buildLog
SHA: https://github.com/cockroachdb/cockroach/commits/11987dc7e2f20f6417925c638f45c2067435b17f
Parameters:
TAGS=
GOFLAGS=-race
Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=256725&tab=buildLog
SHA: https://github.com/cockroachdb/cockroach/commits/7fb6d983651d77cbd200f3553ce842a82fcf30ff
Parameters:
TAGS=
GOFLAGS=-race
Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=257860&tab=buildLog
[03:26:27][Step 2/2] client_test.go:1027: [n1,s1,r1/1:/M{in-ax}]: can't set pending snapshot index to 1; pending snapshot already present: 141
Reassigning to @petermattis
Reproduces readily enough:
~ make stressrace PKG=./storage/ TESTS=TestRaftRemoveRace
...
--- FAIL: TestRaftRemoveRace (8.56s)
client_test.go:1027: [n1,s1,r1/1:/M{in-ax}]: can't set pending snapshot index to 1; pending snapshot already present: 119
FAIL
ERROR: exit status 1
9 runs completed, 1 failures, over 1m0s
FAIL
I'm seeing what appears to be an incorrect retry. The test is executing a series of AdminChangeReplicas
operations, adding and removing the same replica over and over. The range is replicated to 10 nodes. At some point, the AdminChangeReplicas
operation takes too long and we see:
I170526 14:05:54.162334 9 kv/dist_sender.go:1155 r1: sending batch 1 AdmChangeReplicas to (n1,s1):1
I170526 14:05:54.679806 9 kv/dist_sender.go:1180 timeout, trying next peer: (n2,s2):2
I170526 14:05:54.796071 9 kv/dist_sender.go:1312 error, trying next peer: (n10,s10):10
I170526 14:05:54.797355 9 kv/dist_sender.go:1312 error, trying next peer: (n4,s4):4
I170526 14:05:54.798431 9 kv/dist_sender.go:1312 error, trying next peer: (n5,s5):5
I170526 14:05:54.799595 9 kv/dist_sender.go:1312 error, trying next peer: (n6,s6):6
I170526 14:05:54.829640 9 kv/dist_sender.go:1312 error, trying next peer: (n7,s7):7
I170526 14:05:54.888469 9 kv/dist_sender.go:1312 error, trying next peer: (n8,s8):8
I170526 14:05:54.921464 9 kv/dist_sender.go:1312 error, trying next peer: (n9,s9):9
I170526 14:05:54.922616 9 kv/dist_sender.go:1312 error, trying next peer: (n3,s3):13
Not shown is that the last error is RangeNotFound
. This causes DistSender.sendToReplicas
to return (even though there is still a pending RPC to n1
). DistSender.sendPartialBatch
then merrily retries the AdminChangeReplicas
and we send the same operation to n1
again and hit the above error.
Two things seem suspicious here:
SendNextTimeout
is way to short for something like AdminChangeReplicas
. DistSender.sendToReplicas
shouldn't return on the first RangeNotFoundError
if there is still an outstanding RPC.@spencerkimball This is in your area of expertise (or, at least, you touched it recently).
@tamird Weren't you looking at the DistSender
timeouts recently?
Also @bdarnell, since you've had thoughts about timeouts recently.
I did look at the timeouts, but ended up only increasing the abandonment timeout (#16088), which shouldn't affect this scenario given your description above.
There seems to be growing doubt about the usefulness of SendNextTimeout
, and in general the idea of sending RPCs concurrently to multiple replicas. It's pretty subtle to get right (and we haven't), especially in the absence of compiler assistance (e.g. pattern matching exhaustiveness checking would force us to deal with every error type). Perhaps we should prototype removing these concurrent RPCs and measure the effect on performance (I'd expect tail latency to suffer somewhat).
Ah, actually this might have been exacerbated by #16088 which also introduced not waiting on outstanding RPCs if there's no EndTransaction
in the batch (https://github.com/cockroachdb/cockroach/pull/16088#discussion_r118290624).
For context, removing SendNextTimeout has been discussed in https://github.com/cockroachdb/cockroach/issues/16119#issuecomment-304400427 and https://github.com/cockroachdb/cockroach/issues/6719#issuecomment-283468686. SendNextTimeout
was set to its current value in #6728, after we (incorrectly) concluded that #6688 had resolved all the issues with this mechanism.
It's pretty subtle to get right (and we haven't)
It's certainly fragile and easy to break (and we have), but with the waiting restored in #16181 I think things are back in working order. Are there any remaining known issues?
Either way, this feature is causing a lot of trouble for something that we have a hard time even articulating the benefit of. We should increase base.DefaultSendNextTimeout
to a high value and see what that does to tail latencies. I expect that will show it's safe to remove, and/or give us ideas about less tricky alternatives.
10m
is a very high timeout. This might break the Jepsen tests, where we routinely see an RPC lost in a network partition, a new Raft leader elected, and the code in DistSender
is expected to retry and find that new leader to complete the original request. 10s
would be far more appropriate.
10m is intended to test the hypothesis that we can remove this timeout completely - if setting it to a high value causes problems, we want to know about it so we can see if there are better ways of handling them than this timeout. In the network partition scenario you describe, the GRPC connection health checking should abort the connection well before we'd hit a 10s timeout.
This hasn't been seen since #16181.
SHA: https://github.com/cockroachdb/cockroach/commits/e41b5f3c13f492369d48d09c921b3732209f11e2
Parameters:
Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=242304&tab=buildLog