storage: panic in raftGroup.Step while running under stressrace

a-robinson commented 7 years ago

Running at head (150b52daca1127372c6e925a164b9370586a570b), I was stressing TestRemovePlaceholderRace on an azworker to try to repro #12574 while looking into #14193. After running 5606 runs for over an hour, I hit a different (but possibly related) panic:

E170317 15:36:04.978177 166 vendor/github.com/coreos/etcd/raft/log.go:191  [s2,r1/24:{-},@c4208f9880] tocommit(123) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(123) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? [recovered]
    panic: tocommit(123) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

goroutine 166 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc42001ebe0)
    /home/alex/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:198 +0xd8
panic(0x2135780, 0xc420543df0)
    /home/alex/go1.8/src/runtime/panic.go:489 +0x2f0
github.com/cockroachdb/cockroach/pkg/storage.(*raftLogger).Panicf(0xc420543b50, 0x23bf79b, 0x5d, 0xc420c8c460, 0x2, 0x2)
    /home/alex/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.go:111 +0x11d
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc420c95500, 0x7b)
    /home/alex/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/log.go:191 +0x1fe
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc4201524b0, 0x8, 0x18, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /home/alex/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1100 +0x7a
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.stepFollower(0xc4201524b0, 0x8, 0x18, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /home/alex/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:1046 +0x412
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc4201524b0, 0x8, 0x18, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /home/alex/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/raft.go:778 +0x10e5
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*RawNode).Step(0xc42056bf80, 0x8, 0x18, 0x1, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /home/alex/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/rawnode.go:195 +0x11e
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest.func4(0xc42056bf80, 0x0, 0x0, 0x0)
    /home/alex/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3144 +0xf1
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroupLocked(0xc4208f9880, 0x23d3300, 0xc42074d6f0, 0xc4208f9988, 0xc)
    /home/alex/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:539 +0x1c2
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroup(0xc4208f9880, 0xc42074d6f0, 0x0, 0x0)
    /home/alex/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:556 +0xa7
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest(0xc420219880, 0x2b2220281800, 0xc42039d140, 0xc4204c8900, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /home/alex/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3145 +0x11aa
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc420219880, 0x1)
    /home/alex/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3385 +0x1f0
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc4202102d0, 0xc42001ebe0)
    /home/alex/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:225 +0x30c
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2()
    /home/alex/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x4b
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc42001ebe0, 0xc4201b63e0)
    /home/alex/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:209 +0x8b
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
    /home/alex/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:210 +0x74

All I ran to kick off the testing was:

make stressrace PKG=./storage/ TESTS=TestRemovePlaceholderRace STRESSFLAGS='-stderr -maxfails 1 -p 32'

Full logs from the run: raft_panic.txt

@bdarnell @petermattis

petermattis commented 7 years ago

We've seen this in the past. I need to refresh my memory here. Also see https://github.com/cockroachdb/cockroach/issues/9037 and https://github.com/cockroachdb/cockroach/issues/11591.

bdarnell commented 7 years ago

This error (out of range [lastIndex(0)]) is the same message as #11591 (fixed by #11699) and #12863 (still open). #9037 looks a little different. This could easily be the same underlying race as #14193. I don't have any particular insights right now; I'd need to page this stuff back in (I don't think any of this has changed recently so the bug has probably been there a while).

a-robinson commented 7 years ago

This is the exact same bug as in #14193 (and #12574). The only difference from the explanation in https://github.com/cockroachdb/cockroach/issues/14193#issuecomment-287493273 is that rather than the replica without a valid key range going through replica GC and the panic happening in removeReplicaImpl, the replica is receiving some other raft message and attempting to process it, causing the panic to happen in raft because the raft.RawNode was created with a default (i.e. 0) last applied index.

And as @bdarnell pointed out, #11591 and #12863 appear to be the same bug judging by the non-specified key range and the 0 lastIndex.

Closing as a dupe.

a-robinson commented 7 years ago

The above analysis was from reading the code, but I got a fresh failure of this with --verbosity=3, and while it's still essentially the same problem as #14193, the series of steps involved is slightly different:

Replica 10 on s2 gets removed from the range by the other two nodes, but takes its sweet time going through the replica GC queue so a lot happens in the meantime before it gets GC'ed
A preemptive snapshot is sent to s2 as part of an ADD_REPLICA operation for replica 11
As in #14193, the snapshot is ignored by raft, but returns success
A raft request comes in for replica 11, causing the existing replica to have its ID updated to 11 (without the Replica object being destroyed, i.e. the pointer address stayed the same)
The range executes some raft requests
The replica GC from the beginning finally happens, removing all the range's keys and destroying it
A MsgHeartbeat caused a new replica to be created for the range (with ID 11), and the processing of that heartbeat is when the panic happened

raft_panic3.txt

a-robinson commented 7 years ago

It's worth noting that this is still reproducible even with #14306, so there's still more understanding to be done, as mentioned in https://github.com/cockroachdb/cockroach/issues/14193#issuecomment-288199359

a-robinson commented 7 years ago

Alright, I have this tracked down as well.

Replica 11 receives and applies/ignores a preemptive snapshot.
Replica 11 starts going through the replica GC queue because it got a raft response saying that it had been removed. Note that it had not yet processed its own removal.
The replica GC queue processing code does a RangeDescriptor lookup to get a consistent view of what replicas are meant to be in the range.
Before the RangeDescriptor lookup returns, the other two nodes propose and apply an ADD_REPLICA command for replica ID 12, successfully applying the EntryConfChange in raft.
The raft leader sends a MsgApp to replica ID 12. At this point replica 11 is still around in the store, so it gets its ID incremented before processing the MsgApp and responding. Note that we do nothing in this code path to increment the replica's minReplicaID, as no new RangeDescriptor has been received.
The replica GC queue's RangeDescriptor lookup returns. The descriptor indicates that no replica should be on this node/store, and that the next replica ID is 12 (i.e. it's in sync with when the lookup was made, but it's out of date by the time it's been returned)
Before store.removeReplicaImpl has been run on the replica 11, the replica gets added back into the replica GC queue due to a goroutine in replica.handleReplicatedEvalResult that was handling the raft command for removing replica 11.
store.removeReplicaImpl destroys the old replica. Note that it uses the "consistent" descriptor passed down to it by the replica GC queue for writing the replica tombstone. This means that it used the old view of the range that thought 12 was the next replica ID when in reality it had already been operating as replica 12.
A raft MsgHeartbeat comes in for replica 12 because it hasn't yet been removed from the raft group.
When the heartbeat comes in, the store creates a new replica with ID 12 and a raft node for that replica with ID 12 (which is allowed due to the faulty tombstone that was written during replica GC). Note that there was already a raft node with ID 12 that had data, but this new raft node with ID 12 doesn't have data.
The new raft node processes the heartbeat, but panics because it doesn't have data that the raft leader assumes it has.

a-robinson commented 7 years ago

Note that step 7 isn't actually relevant, I just forgot to scrub it before posting this.

tamird commented 7 years ago

The raft leader sends a MsgApp to replica ID 12. At this point replica 11 is still around in the store, so it gets its ID incremented before processing the MsgApp and responding.

Can you point to the code location where this happens? It seems Wrong™ to set the replica ID of any replica that already has a non-zero ID.

a-robinson commented 7 years ago

(mostly copied from https://github.com/cockroachdb/cockroach/issues/14193#issuecomment-288199359)

raftTransport.RaftMessageBatch -> raftTransport.handleRaftRequest -> store.HandleRaftRequest (-> store.uncoalesceBeats) -> store.HandleRaftUncoalescedRequest (scheduler.EnqueueRaftRequest -> scheduler.worker -> store.processRaftRequest if respStream is non-nil) - > store.processRaftRequest -> store.getOrCreateReplica -> store.tryGetOrCreateReplica -> replica.SetReplicaID

With the logic for changing the ID living in replica.setReplicaIDLocked. As covered in some of the discussion in #14193, it's considered valid for a replica ID to be increased. It seems unintuitive to me too, but currently is meant to be allowed.

bdarnell commented 7 years ago

Yeah, it's odd (but working as originally intended) that a replica can transition directly between two non-zero IDs. The original reason for this is that it took a long time for replicas to get GC'd (before the introduction of raft responses). Now that we have them (along with other kinds of eager GC), maybe we should do something else with these messages (just drop them?) instead of allowing them to increase the replica ID of a live replica.

a-robinson commented 7 years ago

It'd probably be ideal to synchronize the ID change with replica GC somehow -- either we bump the replica ID up OR we let replica GC take place, but not both. Doing the GC after the replica ID increase forces the raft group to have to remove the member with the newly GC'ed ID and create a new one.

jordanlewis commented 7 years ago

I think this might still be happening in 1.0, according to Sentry.

github.com/cockroachdb/cockroach/pkg/storage/raft.go in Panicf at line 109
Called from: github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/log.go in commitTo
github.com/cockroachdb/cockroach/pkg/storage/store.go in func4 at line 3104
github.com/cockroachdb/cockroach/pkg/storage/replica.go in withRaftGroupLocked at line 489
github.com/cockroachdb/cockroach/pkg/storage/replica.go in withRaftGroup at line 506
github.com/cockroachdb/cockroach/pkg/storage/store.go in processRaftRequest at line 3105
github.com/cockroachdb/cockroach/pkg/storage/store.go in processRequestQueue at line 3388
github.com/cockroachdb/cockroach/pkg/storage/scheduler.go in worker at line 228
github.com/cockroachdb/cockroach/pkg/storage/scheduler.go in func2 at line 168
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go in func1 at line 215

a-robinson commented 7 years ago

Ouch, that's a nasty one to see again. I assume we don't have any info about the parameters to the methods or the string printed by Panicf?

jordanlewis commented 7 years ago

Correct.

petermattis commented 7 years ago

@dt Is there a facility to whitelist some calls to panic so we can see the error instead of the type? I seem to recall you working on that.

dt commented 7 years ago

@petermattis There is not in 1.0 -- i had initially added a wrapped error container for reporting, but realized my initial impl was overly broad in what it was considering as safe to report, so I switched it back to a more conservative type-and-location-only rule to get something we could safely ship in 1.0.

I'm chatting with @knz today about a broader strategy for error handling -- how to whitelist for reporting, for returning via pgwire vs. logging internally -- and will hopefully have a more concrete answer soon.

nvanbenschoten commented 7 years ago

I just saw this on denim (cockroach-denim-0001) when trying to restart a node that had crashed about 24 hours ago due to an out-of-memory error. @a-robinson is there any information you'd like me to pull off the machine to help with this investigation? I can also see if this reproduces when restarting again, but I'll wait for your cue in-case you don't want to risk disturbing something.

a-robinson commented 7 years ago

Thanks for the heads up, @nvanbenschoten! I'm in the process of saving the logs off all the machines and the relevant range state off node 1. Once that's done I'd like to restart node 1 with more verbose logging enabled (replica=5,store=5,raft=5,raft_transport=5,replica_gc_queue=5). Should I ping you once I'm ready or just do it myself?

nvanbenschoten commented 7 years ago

Nope, go ahead and restart on your own when you're ready.

a-robinson commented 7 years ago

Actually, I'm not sure that'll be needed. It looks like range 1's disk got wiped, and when it restarted with an empty data directory it received an internal RPC for a range that it didn't have, because it only had a brand new r1:

$ ./cockroach debug range-descriptors  /mnt/data1/
1504294957.647360330,0 /Local/Range/Min/RangeDescriptor: [/Min, /Max)
    Raw:r1:/M{in-ax} [(n1,s1):1, next=2]

In other words, the fact that we haven't yet fixed https://github.com/cockroachdb/cockroach/issues/15801 means that restarting a node with no join flags and a wiped data directory leads to these crashes.

a-robinson commented 7 years ago

I'm not sure what caused the disk to get wiped, though. Azure doesn't indicate that the VM has had any issues, and the machine thinks it's been running for months:

$ uptime
 20:21:49 up 168 days,  6:00,  2 users,  load average: 0.00, 0.00, 0.00

Someone must have manually wiped it, as far as I can tell.

nvanbenschoten commented 7 years ago

@petermattis are you aware of demin-1 being wiped in the last 24 hours? The OOM error certainly shouldn't have led to this.

Also, perhaps the Sentry issue that reopenned this is also due to #15801. @a-robinson can we confirm/deny this?

petermattis commented 7 years ago

I think you're seeing the difference between /mnt/data1 and /mnt/data1/cockroach. The latter is used by roachperf and the former by supervisor. During my testing, the nodes on denim were started using roachperf but it looks like someone (@nvanbenschoten) tried to start a denim-01 using supervisorctl.

petermattis commented 7 years ago

PS I should fix roachperf to use /mnt/data1.

nvanbenschoten commented 7 years ago

Yes, that's correct. I tried to start denim-01 with supervisorctl without realizing the difference. I guess this was just a false alarm then (outside of the very real #15801).

a-robinson commented 7 years ago

I think you're seeing the difference between /mnt/data1 and /mnt/data1/cockroach.

Yup, that looks correct. It's quite the blessing in disguise given that it may explain the sentry errors. Almost all of them happened on node 1, where this sort of mixup is most likely.

petermattis commented 7 years ago

Well, I'll fix roachperf in any case.

a-robinson commented 7 years ago

The question now is whether we can actually fix this in 1.1. I'm thinking probably not. We clearly can't implement #18058 as a true fix for #15801 at this point, it'd be way too much new logic.

We could try hacky patches like disallowing certain types of raft messages to replicas that don't exist, but I wouldn't have much confidence that it wouldn't cause problems under normal operation.

And honestly, anything preventing a quick panic might only make things worse, since the longer the empty node is running the more likely it can mess up one of the real nodes.

So unless anyone has differing opinions, I'm going to bump this to 1.2. We definitely shouldn't push off #15801/#18058 again, though. I'll add an airtable entry.

petermattis commented 7 years ago

We should get started on #18058 ASAP and consider cherry-picking it into 1.1.

bdarnell commented 7 years ago

I think #18058 is too risky for 1.1.

I'd say the root cause of this issue is that the "bootstrap if no join flag" behavior of 1.0 is error-prone and dangerous. In 1.1, we have the init command, so we should make sure that all our docs and sample configs use it. And we should probably get rid of the implicit initialization in 1.2 - the join flag (or maybe a new explicit --init flag) would become mandatory.

a-robinson commented 6 years ago

Update: Most of the work for this is done, but being able to pull out the log message from #18203 relies on handling grpc reconnections, which is still being tracked by https://github.com/cockroachdb/cockroach/issues/20537

bdarnell commented 6 years ago

I was never able to reproduce this, but optimistically closing as fixed by #22518 (and the #18203 log message is gone)

cockroachdb / cockroach

storage: panic in raftGroup.Step while running under stressrace #14231