Closed a-robinson closed 6 years ago
We've seen this in the past. I need to refresh my memory here. Also see https://github.com/cockroachdb/cockroach/issues/9037 and https://github.com/cockroachdb/cockroach/issues/11591.
This error (out of range [lastIndex(0)]
) is the same message as #11591 (fixed by #11699) and #12863 (still open). #9037 looks a little different. This could easily be the same underlying race as #14193. I don't have any particular insights right now; I'd need to page this stuff back in (I don't think any of this has changed recently so the bug has probably been there a while).
This is the exact same bug as in #14193 (and #12574). The only difference from the explanation in https://github.com/cockroachdb/cockroach/issues/14193#issuecomment-287493273 is that rather than the replica without a valid key range going through replica GC and the panic happening in removeReplicaImpl
, the replica is receiving some other raft message and attempting to process it, causing the panic to happen in raft because the raft.RawNode
was created with a default (i.e. 0) last applied index.
And as @bdarnell pointed out, #11591 and #12863 appear to be the same bug judging by the non-specified key range and the 0 lastIndex
.
Closing as a dupe.
The above analysis was from reading the code, but I got a fresh failure of this with --verbosity=3
, and while it's still essentially the same problem as #14193, the series of steps involved is slightly different:
ADD_REPLICA
operation for replica 11Replica
object being destroyed, i.e. the pointer address stayed the same)MsgHeartbeat
caused a new replica to be created for the range (with ID 11), and the processing of that heartbeat is when the panic happenedIt's worth noting that this is still reproducible even with #14306, so there's still more understanding to be done, as mentioned in https://github.com/cockroachdb/cockroach/issues/14193#issuecomment-288199359
Alright, I have this tracked down as well.
store.removeReplicaImpl
has been run on the replica 11, the replica gets added back into the replica GC queue due to a goroutine in replica.handleReplicatedEvalResult
that was handling the raft command for removing replica 11.store.removeReplicaImpl
destroys the old replica. Note that it uses the "consistent" descriptor passed down to it by the replica GC queue for writing the replica tombstone. This means that it used the old view of the range that thought 12 was the next replica ID when in reality it had already been operating as replica 12.MsgHeartbeat
comes in for replica 12 because it hasn't yet been removed from the raft group.Note that step 7 isn't actually relevant, I just forgot to scrub it before posting this.
The raft leader sends a MsgApp to replica ID 12. At this point replica 11 is still around in the store, so it gets its ID incremented before processing the MsgApp and responding.
Can you point to the code location where this happens? It seems Wrong™ to set the replica ID of any replica that already has a non-zero ID.
(mostly copied from https://github.com/cockroachdb/cockroach/issues/14193#issuecomment-288199359)
raftTransport.RaftMessageBatch
-> raftTransport.handleRaftRequest
-> store.HandleRaftRequest
(-> store.uncoalesceBeats
) -> store.HandleRaftUncoalescedRequest
(scheduler.EnqueueRaftRequest
-> scheduler.worker
-> store.processRaftRequest
if respStream
is non-nil) - > store.processRaftRequest
-> store.getOrCreateReplica
-> store.tryGetOrCreateReplica
-> replica.SetReplicaID
With the logic for changing the ID living in replica.setReplicaIDLocked
. As covered in some of the discussion in #14193, it's considered valid for a replica ID to be increased. It seems unintuitive to me too, but currently is meant to be allowed.
Yeah, it's odd (but working as originally intended) that a replica can transition directly between two non-zero IDs. The original reason for this is that it took a long time for replicas to get GC'd (before the introduction of raft responses). Now that we have them (along with other kinds of eager GC), maybe we should do something else with these messages (just drop them?) instead of allowing them to increase the replica ID of a live replica.
It'd probably be ideal to synchronize the ID change with replica GC somehow -- either we bump the replica ID up OR we let replica GC take place, but not both. Doing the GC after the replica ID increase forces the raft group to have to remove the member with the newly GC'ed ID and create a new one.
I think this might still be happening in 1.0, according to Sentry.
github.com/cockroachdb/cockroach/pkg/storage/raft.go in Panicf at line 109
Called from: github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/log.go in commitTo
github.com/cockroachdb/cockroach/pkg/storage/store.go in func4 at line 3104
github.com/cockroachdb/cockroach/pkg/storage/replica.go in withRaftGroupLocked at line 489
github.com/cockroachdb/cockroach/pkg/storage/replica.go in withRaftGroup at line 506
github.com/cockroachdb/cockroach/pkg/storage/store.go in processRaftRequest at line 3105
github.com/cockroachdb/cockroach/pkg/storage/store.go in processRequestQueue at line 3388
github.com/cockroachdb/cockroach/pkg/storage/scheduler.go in worker at line 228
github.com/cockroachdb/cockroach/pkg/storage/scheduler.go in func2 at line 168
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go in func1 at line 215
Ouch, that's a nasty one to see again. I assume we don't have any info about the parameters to the methods or the string printed by Panicf
?
Correct.
@dt Is there a facility to whitelist some calls to panic so we can see the error instead of the type? I seem to recall you working on that.
@petermattis There is not in 1.0 -- i had initially added a wrapped error container for reporting, but realized my initial impl was overly broad in what it was considering as safe to report, so I switched it back to a more conservative type-and-location-only rule to get something we could safely ship in 1.0.
I'm chatting with @knz today about a broader strategy for error handling -- how to whitelist for reporting, for returning via pgwire vs. logging internally -- and will hopefully have a more concrete answer soon.
I just saw this on denim
(cockroach-denim-0001
) when trying to restart a node that had crashed about 24 hours ago due to an out-of-memory error. @a-robinson is there any information you'd like me to pull off the machine to help with this investigation? I can also see if this reproduces when restarting again, but I'll wait for your cue in-case you don't want to risk disturbing something.
Thanks for the heads up, @nvanbenschoten! I'm in the process of saving the logs off all the machines and the relevant range state off node 1. Once that's done I'd like to restart node 1 with more verbose logging enabled (replica=5,store=5,raft=5,raft_transport=5,replica_gc_queue=5
). Should I ping you once I'm ready or just do it myself?
Nope, go ahead and restart on your own when you're ready.
Actually, I'm not sure that'll be needed. It looks like range 1's disk got wiped, and when it restarted with an empty data directory it received an internal RPC for a range that it didn't have, because it only had a brand new r1:
$ ./cockroach debug range-descriptors /mnt/data1/
1504294957.647360330,0 /Local/Range/Min/RangeDescriptor: [/Min, /Max)
Raw:r1:/M{in-ax} [(n1,s1):1, next=2]
In other words, the fact that we haven't yet fixed https://github.com/cockroachdb/cockroach/issues/15801 means that restarting a node with no join flags and a wiped data directory leads to these crashes.
I'm not sure what caused the disk to get wiped, though. Azure doesn't indicate that the VM has had any issues, and the machine thinks it's been running for months:
$ uptime
20:21:49 up 168 days, 6:00, 2 users, load average: 0.00, 0.00, 0.00
Someone must have manually wiped it, as far as I can tell.
@petermattis are you aware of demin-1
being wiped in the last 24 hours? The OOM error certainly shouldn't have led to this.
Also, perhaps the Sentry issue that reopenned this is also due to #15801. @a-robinson can we confirm/deny this?
I think you're seeing the difference between /mnt/data1
and /mnt/data1/cockroach
. The latter is used by roachperf
and the former by supervisor
. During my testing, the nodes on denim
were started using roachperf
but it looks like someone (@nvanbenschoten) tried to start a denim-01
using supervisorctl
.
PS I should fix roachperf
to use /mnt/data1
.
Yes, that's correct. I tried to start denim-01
with supervisorctl
without realizing the difference. I guess this was just a false alarm then (outside of the very real #15801).
I think you're seeing the difference between /mnt/data1 and /mnt/data1/cockroach.
Yup, that looks correct. It's quite the blessing in disguise given that it may explain the sentry errors. Almost all of them happened on node 1, where this sort of mixup is most likely.
Well, I'll fix roachperf
in any case.
The question now is whether we can actually fix this in 1.1. I'm thinking probably not. We clearly can't implement #18058 as a true fix for #15801 at this point, it'd be way too much new logic.
We could try hacky patches like disallowing certain types of raft messages to replicas that don't exist, but I wouldn't have much confidence that it wouldn't cause problems under normal operation.
And honestly, anything preventing a quick panic might only make things worse, since the longer the empty node is running the more likely it can mess up one of the real nodes.
So unless anyone has differing opinions, I'm going to bump this to 1.2. We definitely shouldn't push off #15801/#18058 again, though. I'll add an airtable entry.
We should get started on #18058 ASAP and consider cherry-picking it into 1.1.
I think #18058 is too risky for 1.1.
I'd say the root cause of this issue is that the "bootstrap if no join flag" behavior of 1.0 is error-prone and dangerous. In 1.1, we have the init
command, so we should make sure that all our docs and sample configs use it. And we should probably get rid of the implicit initialization in 1.2 - the join flag (or maybe a new explicit --init
flag) would become mandatory.
Update: Most of the work for this is done, but being able to pull out the log message from #18203 relies on handling grpc reconnections, which is still being tracked by https://github.com/cockroachdb/cockroach/issues/20537
I was never able to reproduce this, but optimistically closing as fixed by #22518 (and the #18203 log message is gone)
Running at head (150b52daca1127372c6e925a164b9370586a570b), I was stressing
TestRemovePlaceholderRace
on an azworker to try to repro #12574 while looking into #14193. After running 5606 runs for over an hour, I hit a different (but possibly related) panic:All I ran to kick off the testing was:
Full logs from the run: raft_panic.txt
@bdarnell @petermattis