cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.06k stars 3.8k forks source link

storage: non-zero HardState.Commit on uninitialized replica #40213

Closed tbg closed 5 years ago

tbg commented 5 years ago
cockroach debug merge-logs *.logs/*.log | grep -F r3157
ip-10-12-37-237> I190822 21:27:54.723498 883796922 storage/store.go:2593  [n5,replicaGC,s5,r3157/2:/Table/57/1/9{2/99964-3/25360}] removing replica r3157/2
ip-10-12-37-237> I190822 21:27:54.724112 883796922 storage/replica_destroy.go:152  [n5,replicaGC,s5,r3157/2:/Table/57/1/9{2/99964-3/25360}] removed 25401 (25396+5) keys in 1ms [clear=0ms commit=0ms]
ip-10-12-37-237> F190823 18:30:39.636004 297 storage/replica_raft.go:668  [n5,s5,r3157/5:{-}] setting non-zero HardState.Commit on uninitialized replica [n5,s5,r3157/5:{-}]. HS={Term:7 Vote:0 Commit:66957 XXX_unrecognized:[]}

Seen on the release qualification cluster for https://github.com/cockroachdb/cockroach/issues/39781

Also seen in unit tests: https://github.com/cockroachdb/cockroach/issues/39660#issue-480716022

tbg commented 5 years ago

Just got a repro from my gceworker running

make stressrace PKG=./pkg/storage/ TESTS=TestNodeLivenessSetDecommissioning

(this test has other failure modes too, see https://github.com/cockroachdb/cockroach/issues/40191 https://github.com/cockroachdb/cockroach/issues/39660#issue-480716022)

tbg commented 5 years ago

Here's said repro restricted to r6

W190826 16:00:19.700267 265 storage/store.go:1546  [s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W190826 16:00:19.758138 265 storage/store.go:1546  [s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
E190826 16:00:22.235320 1383 storage/replica_range_lease.go:339  [s1,r6/1:/Table/{SystemCon…-11}] heartbeat failed on epoch increment
W190826 16:00:22.831772 2082 storage/store.go:1546  [s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
W190826 16:00:22.884958 2082 storage/store.go:1546  [s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I190826 16:00:24.143661 2132 storage/replica_command.go:1264  [replicate,s1,r6/1:/Table/{SystemCon…-11}] change replicas (add [(n2,s2):2LEARNER] remove []): existing descriptor r6:/Table/{SystemConfigSpan/Start-11} [(n1,s1):1, next=2, gen=0]
I190826 16:00:24.219212 2132 storage/replica_raft.go:291  [replicate,s1,r6/1:/Table/{SystemCon…-11},txn=5a84301f] proposing ADD_REPLICA[(n2,s2):2LEARNER]: after=[(n1,s1):1 (n2,s2):2LEARNER] next=3
I190826 16:00:24.346032 2132 storage/store_snapshot.go:995  [replicate,s1,r6/1:/Table/{SystemCon…-11}] sending LEARNER snapshot 03b9cc5e at applied index 16
I190826 16:00:24.357993 2132 storage/store_snapshot.go:1038  [replicate,s1,r6/1:/Table/{SystemCon…-11}] streamed snapshot to (n2,s2):2: kv pairs: 44, log entries: 0, rate-limit: 8.0 MiB/sec, 0.09s
I190826 16:00:24.452298 2865 storage/replica_raftstorage.go:808  [s2,r6/2:{-}] applying LEARNER snapshot [id=03b9cc5e index=16]
I190826 16:00:24.500233 2865 storage/replica_raftstorage.go:829  [s2,r6/2:/Table/{SystemCon…-11}] applied LEARNER snapshot [total=46ms ingestion=4@1ms id=03b9cc5e index=16]
I190826 16:00:24.645917 2132 storage/replica_command.go:1264  [replicate,s1,r6/1:/Table/{SystemCon…-11}] change replicas (add [(n2,s2):2] remove []): existing descriptor r6:/Table/{SystemConfigSpan/Start-11} [(n1,s1):1, (n2,s2):2LEARNER, next=3, gen=1]
F190826 16:00:24.818729 2886 storage/replica_raft.go:668  [s2,r6/2:{-}] setting non-zero HardState.Commit on uninitialized replica [n2,s2,r6/2:{-}]. HS={Term:6 Vote:0 Commit:18 XXX_unrecognized:[]}

It looks like we're applying a snapshot and then hitting the fatal, perhaps even in the same ready cycle. What's weird is that after a snapshot the replica is supposed to be initialized, but it isn't - the log tags show this (no key range) and also we don't hit the fatal if it is initialized.

tbg commented 5 years ago

It's worth mentioning that in the release cluster crash (first message), no such snapshot is seen just before the crash. The logs have been rotated heavily on the other nodes but n5 (the crashing one) has logs from four days ago and it would've logged any incoming snapshots for that range. Possibly it was being created through a split, but then it's even weirder that it would be uninitialized.

tbg commented 5 years ago

I have a crash with vmodule=raft=5 (few minutes of running stressrace): https://gist.github.com/tbg/892014e57dde6d17debd0833d3053f3d

The real problem seems to 100% be related to the snapshot. We see r6/2 apply the snapshot:

I190826 16:11:12.325953 1910 storage/replica_raftstorage.go:829  [s2,r6/2:/Table/{SystemCon…-11}] applied LEARNER snaps
hot [total=19ms ingestion=4@18ms id=efbc68d5 index=17]

note how it has key bounds in the log tags. We see this emit an MsgAppResp to the leader as it should.

Then when the next raft message comes in, we see that the raft group is getting created again (I'm not sure why) and the replica is descriptorless (=uninited) again. The leader sends an MsgApp and that triggers the crash.

My working day is coming to a close and I'd like to focus on atomic replication changes anyway, so I'd like to hand this off at this point.

nvanbenschoten commented 5 years ago

I'm also seeing the raft group being created twice. Interestingly, what I see is that two different *Store objects handle the requests, and each starts with an empty s.mu.replicas syncutil.IntMap.

nvanbenschoten commented 5 years ago

It looks like the Store is being stopped and then started again, so I guess we would expect this. Interestingly, the logs make it look like we're not done applying the LEARNER snapshot until after the new Store has started back up.

nvanbenschoten commented 5 years ago

And we see the range descriptor iteration for the next epoch of the Store run before the previous epoch applies the learner snapshot. Are we not waiting for the snapshot to finish before allowing the Store to drain?

nvanbenschoten commented 5 years ago

I'm guessing we're hitting this because of: https://github.com/cockroachdb/cockroach/blob/d0f20032a1adcef42e6fed843b40286b48a1d1b5/pkg/storage/node_liveness_test.go#L924-L927

nvanbenschoten commented 5 years ago

Yep, we apply the LEARNER snapshot on the original Store after calling restartStore. So I guess what's happening is that the original Store is shut down, but doesn't exit immediately. The new Store starts up and finds no RangeDescriptor for r6, the original Store applies the LEARNER snapshot, and then the new Store creates its Raft group in such a way that it is able to set a hard-state.

Unfortunately, this all looks like a testing bug and doesn't really explain what we saw in #39781.

nvanbenschoten commented 5 years ago

Ah ok, I think I see what's going wrong here. multiTestContext gives the RaftTransport its own Stopper, and RaftTransport.RaftSnapshot runs its snapshot in an AsyncTask from this stopper. It never consults the Stopper given to the Store itself. So this allows us to leak the snapshot task even after we have shut down the first version of the Store.

This diff fixes the test flake:

diff --git a/pkg/storage/store.go b/pkg/storage/store.go
index 3066b23..cbc1cd2 100644
--- a/pkg/storage/store.go
+++ b/pkg/storage/store.go
@@ -3228,17 +3228,19 @@ func (s *Store) maybeWaitForPushee(
 func (s *Store) HandleSnapshot(
        header *SnapshotRequest_Header, stream SnapshotResponseStream,
 ) error {
-       s.metrics.raftRcvdMessages[raftpb.MsgSnap].Inc(1)
+       ctx := s.AnnotateCtx(stream.Context())
+       return s.stopper.RunTaskWithErr(ctx, "Store.HandleSnapshot", func(ctx context.Context) error {
+               s.metrics.raftRcvdMessages[raftpb.MsgSnap].Inc(1)

-       if s.IsDraining() {
-               return stream.Send(&SnapshotResponse{
-                       Status:  SnapshotResponse_DECLINED,
-                       Message: storeDrainingMsg,
-               })
-       }
+               if s.IsDraining() {
+                       return stream.Send(&SnapshotResponse{
+                               Status:  SnapshotResponse_DECLINED,
+                               Message: storeDrainingMsg,
+                       })
+               }

-       ctx := s.AnnotateCtx(stream.Context())
-       return s.receiveSnapshot(ctx, header, stream)
+               return s.receiveSnapshot(ctx, header, stream)
+       })
 }

 func (s *Store) uncoalesceBeats(

If the stopper in the RaftTransport doesn't tell us anything about the Stopper in the Stores that it delegates to, then we might need to add similar code in its other handler functions as well.

I don't know enough about what changed between pre-emptive snapshot handling and learner snapshot handling to be able to pinpoint why this race only now materialized. Perhaps we never wrote any Raft state during pre-emptive snapshots? @tbg or @danhhz do you have any idea? This also doesn't really help us track down the crash in #39781, although perhaps the symptoms here might be helpful.

nvanbenschoten commented 5 years ago

If the stopper in the RaftTransport doesn't tell us anything about the Stopper in the Stores that it delegates to, then we might need to add similar code in its other handler functions as well.

I think this is ok in other places because they all communicate through the raftScheduler, which internally uses a stopper to avoid task leaks.

tbg commented 5 years ago

Thanks for investigating, shame that this isn't the same thing. Maybe you can go back to the original logs and look at the raft state for the given node. The interesting thing there is that n5 had a replica of that range a long time ago. The problem is that a commit idx is sent when we don't have a log yet and thus this is borked (so is there a log? Why is there no snapshot message? Where does this incoming request come from?). What if the commit idx comes from a quiesce heartbeat? I think we take that idx from the leader's Progress. Since the peer has a new ReplicaID after readding I don't see how a previous commit index could leak past the tombstone, but maybe looking at the code can reveal a way in which this can happen? Though I just looked and didn't see a way. Either way, the fact that n5 itself has nothing about that range for a while and then suddenly explodes means we're probably not looking at snapshots here.

tbg commented 5 years ago

The other option is that maybe the replicaGC was an error? If rx/5 was in fact a member all along (even though n5 only had rx/2) and replicaGC decided by accident to remove that data, then once the range unquiesced this error could get triggered, since the leaseholder would still have an old progress.

I was also looking into whether progress.Match (which is used by the quiesce heartbeat to lower the commit index to something "safe" for the follower) could perhaps exceed what the follower has actually acked. So far it seems that it's not doing that.

tbg commented 5 years ago

I don't know enough about what changed between pre-emptive snapshot handling and learner snapshot handling to be able to pinpoint why this race only now materialized. Perhaps we never wrote any Raft state during pre-emptive snapshots? @tbg or @danhhz do you have any idea? This also doesn't really help us track down the crash in #39781, although perhaps the symptoms here might be helpful.

The learner snapshot generates an immediate MsgAppResp, so the leader is immediately able to append. I think in the olden days the preemptive snapshot wouldn't trigger such a thing and so the first communication would be figuring out what log index to append at, so this probably prevented the race "well enough". The other ingredient here that doesn't seem necessary but is interesting is that with learners the leader gets to send "insta-commit appends" because the quorum is still 1 and so a command appended at the leader is immediately committed and can be sent "as committed" in one MsgApp.

nvanbenschoten commented 5 years ago

I took a peek at the Raft state on the crashed node (n5) as well as that on the other nodes. First: here are the only logs about this range on n5:

I190822 21:27:54.723498 883796922 storage/store.go:2593  [n5,replicaGC,s5,r3157/2:/Table/57/1/9{2/99964-3/25360}] removing replica r3157/2
I190822 21:27:54.724112 883796922 storage/replica_destroy.go:152  [n5,replicaGC,s5,r3157/2:/Table/57/1/9{2/99964-3/25360}] removed 25401 (25396+5) keys in 1ms [clear=0ms commit=0ms]
F190823 18:30:39.636004 297 storage/replica_raft.go:668  [n5,s5,r3157/5:{-}] setting non-zero HardState.Commit on uninitialized replica [n5,s5,r3157/5:{-}]. HS={Term:7 Vote:0 Commit:66957 XXX_unrecognized:[]}

As Tobi pointed out earlier, a replica with ID 2 was removed from the node on 190822 21:27:54. The crash then came on a replica of the same range with ID 5 on 190823 18:30:39.

On n5, we see the following state for range 3157:

The next_replica_id in the Raft tombstone key explains how the request got through the tombstone gate. The next question is why a request was sent to the Range in the first place.

Interestingly, we see that the other members of the Range have log indexes around 285758, so they're well past n5. We also see that they all share the same Raft HardState: term:7 vote:3 commit:285758.

Notice that they all have a vote for 3, meaning that replica 3 is the leader of term 7. Meanwhile, the log from the crashing node shows that it was trying to write the HardState term:7 vote:0 commit:66957. I think that means that we hit this line of code, which indicates that the message being received was neither a MsgApp, a MsgHeartbeat, or a MsgSnap.

My guess is that it was a MsgVote. Specifically, here's my guess as to what happened:

  1. n5/r2 was removed from the Range
  2. n5/r5 was added to the Range as a learner
  3. A snapshot was sent to the Range but it never made it there (or the following steps just raced with it)
  4. The other replicas in the Range held an election and sent n5/r5 a MsgVote before it had accepted its snapshot which would install the local Range descriptor
  5. n5 blew up because it didn't have a local Range descriptor

Allowing learners to vote and sending them MsgVote requests is new as of https://github.com/etcd-io/etcd/commit/c30c2e345b55c7166fefe781c2a7a68a35020082 (vendored in https://github.com/cockroachdb/vendored/commit/f5b60bb09622aef34e2924891a68e04d19436974), so the timing of this recent instability lines up.

danhhz commented 5 years ago

A learner without a local range descriptor definitely needs to be able to vote (and persist that vote). Tobi is the expert here, but I was under the impression that worked already. We're always had to allow replicas without a local range descriptor to vote (right? what happens if you lose a preemptive snap and then have to vote before you apply the conf change that added you). Do you know what would be different if that replica was a learner?

tbg commented 5 years ago

@nvanbenschoten Remind me how this all works - a Raft tombstone would be written atomically with destroying all of the state of the replica, including the raft log, raft applied index, etc, no? And we'll only ever accept log entries after having received a snapshot, which also includes a range descriptor. So how does n5 and up with a range for which it has some stuff including a HardState but no descriptor but also a raft tombstone? Is this maybe an old bug that is only exposed now because learners get to vote and hit the assertion when they do? If that replica is a fresh learner I'd expect it to not have any state except a raft tombstone (optionally), and perhaps a HardState, but with a zero Committed index. Hope I'm missing something big here.

tbg commented 5 years ago

By looking at the range log for that range, you should be able to figure out if that range was added as a learner. What makes me hesitant to declare this obvious is the lack of any snapshot related messages in n5's log, though maybe you're right and they raced and were thus never observed.

nvanbenschoten commented 5 years ago

A learner without a local range descriptor definitely needs to be able to vote (and persist that vote). Tobi is the expert here, but I was under the impression that worked already. We're always had to allow replicas without a local range descriptor to vote (right? what happens if you lose a preemptive snap and then have to vote before you apply the conf change that added you). Do you know what would be different if that replica was a learner?

I'm not confident on any of this, but given this assertion, I don't see how it was ever our intention to let replicas vote before they had local range descriptors.

a Raft tombstone would be written atomically with destroying all of the state of the replica, including the raft log, raft applied index, etc, no? And we'll only ever accept log entries after having received a snapshot, which also includes a range descriptor.

That all lines up with my expectation, and yet the following entry in the Raft log of n5 indicates that its logs were appended by n5/2 and not by n5/5:

0.000000000,0 /Local/RangeID/3157/u/RaftLog/logIndex:66944 (0x0169f70c55757266746c000000000001058000): Term:6 Index:66944  by lease #3
proposer_replica:<node_id:6 store_id:6 replica_id:1 type:VOTER > proposer_lease_sequence:3 max_lease_index:43858 replicated_eval_result:<timestamp:<wall_time:1566509243096461618 > delta:<sys_bytes:171 > > write_batch:<data:"\000\000\000\000\000\000\000\000\002\000\000\000\001\021\001k\022\301\211\344\370\001\206|\000\001rdsc\000X\nB\n\020\366\221\222 \002\265KY\227\233\337M\r\274J\366\032\020\001k\022\301\211\344\370\001\206|\000\001rdsc*\n\010\262\342\205\341\265\373\326\336\0250\247\312\0138\001J\n\010\262\342\205\341\265\373\326\336\025\022\014\010\262\342\205\341\265\373\326\336\025\020\000\030\000 \014(G\001\032\001k\022\301\211\344\370\001\206|\000\001rdsc\000\025\275[\333\\!q2\tG. ./\003\010\325\030\022\007\301\211\344\370\001\206|\032\006\301\211\345\367c\020\"\010\010\006\020\006\030\001 \000\"\010\010\005\020\005\030\002 \000\"\010\010\003\020\003\030\003 \000\"\010\010\002\020\002\030\004 \001(\0050v@\001" >
write batch:
Put: 0.000000000,0 /Local/Range/Table/57/1/92/99964/RangeDescriptor (0x016b12c189e4f801867c00017264736300): 1566509243.096461618,0 {Txn:id:f6919220-02b5-4b59-979b-df4d0dbc4af6 key:"\001k\022\301\211\344\370\001\206|\000\001rdsc" timestamp:<wall_time:1566509243096461618 > min_timestamp:<wall_time:1566509243096461618 > priority:189735 sequence:1  Timestamp:1566509243.096461618,0 Deleted:false KeyBytes:12 ValBytes:71 RawBytes:[] IntentHistory:[] MergeTimestamp:<nil>}
Put: 1566509243.096461618,0 /Local/Range/Table/57/1/92/99964/RangeDescriptor (0x016b12c189e4f801867c0001726473630015bd5bdb5c21713209): [/Table/57/1/92/99964, /Table/57/1/93/25360)
    Raw:r3157:/Table/57/1/9{2/99964-3/25360} [(n6,s6):1, (n5,s5):2, (n3,s3):3, (n2,s2):4LEARNER, next=5, gen=118]

This would indicate n5/5 never accepted any entries and that the only question is why this state wasn't blown away when n5/2 was destroyed and replaced by a range tombstone. I don't have an answer to this, but I'm also not sure it's relevant.

Fundamentally, it seems like we don't support learners voting before they have been sent a snapshot, but we also don't prevent this as far as I can tell. This would lead to these symptoms, right?

tbg commented 5 years ago

Sorry, I should've pointed this out more clearly above: I don't think anything is wrong here with that vote. What's wrong is the nonzero commit index. etcd/raft has the invariant that the leader will never send a Commit index above what the follower has acked in its log. For example, if the leader knows 100 is committed but it only knows that the follower has entry 90, it will make sure to only sent Commit: 90.

I think what you're seeing is that the MsgVote contained a higher Commit index. Could that be the problem?

nvanbenschoten commented 5 years ago

I think what you're seeing is that the MsgVote contained a higher Commit index. Could that be the problem?

But the commit index in the hard state on n5 is exactly the index of the last entry in its log (which itself is still a mystery).

tbg commented 5 years ago

Oh, they match? I missed that. I think that is good news. Let's say it was indeed an MsgVote that triggered the crash (PS, MsgVote never has Commit set, I checked). The raft group was likely not initialized, due to the previous replicaGC (heck, nothing was initialized because the replica was supposedly completely erased). The MsgVote comes in and we create a replica and its raft group. I assume the RawNode first emits the current HardState as an update, even though it's already persisted (have to check). And boom we go, because now we see a Committed but we don't have a state which isn't supposed to happen. The problem then is that the replicaGC missed some data, not that the vote arrived. Any message would've done it with the exception of MsgSnap perhaps.

tbg commented 5 years ago

Ah, the MsgVote did update the HardState -- via a new Term. So that explains why it's coming out. But the commit index comes from the raft log entries, which were not supposed to be there.

nvanbenschoten commented 5 years ago

Ah ok, I missed that we only hit this fatal if !raft.IsEmptyHardState(rd.HardState) and !r.IsInitialized() and rd.HardState.Commit != 0. Ok then I think we're on the same page. The question is why this HardState is still here.

tbg commented 5 years ago

The question is why this HardState is still here.

Yep, and log entries.

I wonder if we'd see this crash frequently if we added an assertion when the group is instantiated. A replica in that state seems very likely to catch a crash when it becomes part of the range again, so I'm not sure we would.

nvanbenschoten commented 5 years ago

Here's a snippet from the Range's rangelog. Notice that n5/2 is removed at 2019-08-22 21:27:54.669235, n5/5 is added at 2019-08-23 18:30:39.152666, and n5/5 is removed at 2019-08-23 18:30:40.522182 (about a second later):

[
   {
      "Timestamp":"2019-08-22 21:27:54.669235+00:00",
      "UpdatedDesc":{
         "range_id":3157,
         "start_key":"wYnk+AGGfA==",
         "end_key":"wYnl92MQ",
         "internal_replicas":[
            {
               "node_id":6,
               "store_id":6,
               "replica_id":1,
               "type":0
            },
            {
               "node_id":2,
               "store_id":2,
               "replica_id":4,
               "type":0
            },
            {
               "node_id":3,
               "store_id":3,
               "replica_id":3,
               "type":0
            }
         ],
         "next_replica_id":5,
         "generation":120,
         "generation_comparable":true
      },
      "RemovedReplica":{
         "node_id":5,
         "store_id":5,
         "replica_id":2,
         "type":0
      },
      "Reason":"range over-replicated",
      "Details":"{\"Target\":\"s5, converges:0, balance:0, rangeCount:3913\"}"
   },
   {
      "Timestamp":"2019-08-23 18:30:39.152666+00:00",
      "UpdatedDesc":{
         "range_id":3157,
         "start_key":"wYnk+AGGfA==",
         "end_key":"wYnl92MQ",
         "internal_replicas":[
            {
               "node_id":6,
               "store_id":6,
               "replica_id":1,
               "type":0
            },
            {
               "node_id":2,
               "store_id":2,
               "replica_id":4,
               "type":0
            },
            {
               "node_id":3,
               "store_id":3,
               "replica_id":3,
               "type":0
            },
            {
               "node_id":5,
               "store_id":5,
               "replica_id":5,
               "type":1
            }
         ],
         "next_replica_id":6,
         "generation":121,
         "generation_comparable":true
      },
      "AddReplica":{
         "node_id":5,
         "store_id":5,
         "replica_id":5,
         "type":1
      },
      "Reason":"rebalance",
      "Details":"{\"Target\":\"s5, converges:1, balance:0, rangeCount:4743\",\"Existing\":\"[\\ns6, converges:1, balance:1, rangeCount:4527\\ns3, converges:1, balance:0, rangeCount:4743\\ns2, converges:0, balance:0, rangeCount:5004]\"}"
   },
   {
      "Timestamp":"2019-08-23 18:30:40.522182+00:00",
      "UpdatedDesc":{
         "range_id":3157,
         "start_key":"wYnk+AGGfA==",
         "end_key":"wYnl92MQ",
         "internal_replicas":[
            {
               "node_id":6,
               "store_id":6,
               "replica_id":1,
               "type":0
            },
            {
               "node_id":2,
               "store_id":2,
               "replica_id":4,
               "type":0
            },
            {
               "node_id":3,
               "store_id":3,
               "replica_id":3,
               "type":0
            }
         ],
         "next_replica_id":6,
         "generation":122,
         "generation_comparable":true
      },
      "RemovedReplica":{
         "node_id":5,
         "store_id":5,
         "replica_id":5,
         "type":1
      },
      "Reason":"rebalance",
      "Details":"{\"Target\":\"s5, converges:1, balance:0, rangeCount:4743\",\"Existing\":\"[\\ns6, converges:1, balance:1, rangeCount:4527\\ns3, converges:1, balance:0, rangeCount:4743\\ns2, converges:0, balance:0, rangeCount:5004]\"}"
   }
]

The node crashed at 190823 18:30:39.636004, so right after it was added back to the Range. This is all consistent with everything else that has been said.

nvanbenschoten commented 5 years ago

So going back to what must have gone wrong, it's very strange that we see the log lines

I190822 21:27:54.723498 883796922 storage/store.go:2593  [n5,replicaGC,s5,r3157/2:/Table/57/1/9{2/99964-3/25360}] removing replica r3157/2
I190822 21:27:54.724112 883796922 storage/replica_destroy.go:152  [n5,replicaGC,s5,r3157/2:/Table/57/1/9{2/99964-3/25360}] removed 25401 (25396+5) keys in 1ms [clear=0ms commit=0ms]

but we don't observe the Range ID local keys to be deleted. For reference, here are all of the keys we see that we would expect to be deleted:

/Local/RangeID/3157/r/RangeLastGC
/Local/RangeID/3157/r/RangeAppliedState
/Local/RangeID/3157/r/RangeLease
/Local/RangeID/3157/u/RaftTombstone
/Local/RangeID/3157/u/RaftHardState
/Local/RangeID/3157/u/RaftLog/logIndex:{66855-66957}
/Local/RangeID/3157/u/RaftTruncatedState
/Local/RangeID/3157/u/RangeLastReplicaGCTimestamp

Interestingly, we don't see any Range local or global keys left in the Range's keyspace. This indicates that out of the following three key range deletions, only the first failed: https://github.com/cockroachdb/cockroach/blob/7be1e524888cebeafda94858c073cd796b13b429/pkg/storage/rditer/replica_data_iter.go#L46-L50

We know that this deletion was run with engine.ClearRangeWithHeuristic and that there are more than 64 raft log keys (clearRangeMinKeys), so would have resulted in a RocksDB range deletion tombstone being written over all of these keys. Where did this range deletion tombstone go?

nvanbenschoten commented 5 years ago

I also see /Local/Store/suggestedCompaction/{/Table/57/1/92/99964-/Table/57/1/93/25360}, which indicates that we did get to: https://github.com/cockroachdb/cockroach/blob/7be1e524888cebeafda94858c073cd796b13b429/pkg/storage/replica_destroy.go#L93-L100

nvanbenschoten commented 5 years ago

So unless I'm missing something obvious, this all seems to point at a RangeDeletion tombstone being dropped somewhere after being added to a batch that was successfully committed. Unfortunately, it could have been any time between 2019-08-22 21:27:54 and 2019-08-23 18:30:39.

@ajkr do you have any suggestions about where an exploration into something like this would start? What tools do we have available at the RocksDB level for performing such an autopsy so late? ldb dump_wal doesn't show a single DELETE_RANGE in the WAL files that are still around, but I didn't have much hope that it would.

tbg commented 5 years ago

So unless I'm missing something obvious, this all seems to point at a RangeDeletion tombstone being dropped

I was worried that that's where we'd end up with this, the trail you've found looks pretty solid.

As far as repro-ing goes, I think we might get some mileage out of adding ./cockroach debug check-store into our roachtest harness and failing the test whenever it finds something. Unfortunately check-store isn't quite there yet, so we'd have to brush it up first, plus we'd only want to run this check when running the master or 19.2 binary.

bdarnell commented 5 years ago

I wonder if cockroachdb/rocksdb#43 missed a path through which the range deletion block could be read and cached. For example, if verify_checksums_before_ingest were true (it's not), we'd try to read the whole file to verify its checksums and wouldn't distinguish the range deletion block from other blocks (I haven't confirmed whether this would populate the cache or not).

Maybe the TableReader in GetIngestedFileInfo should be created without a block cache so we don't cache anything from the sst before its global sequence number is assigned instead of setting specific reads to bypass the cache.

(Or am I barking up the wrong tree here because the range tombstone here came from replica GC instead of from a snapshot ingestion?)

nvanbenschoten commented 5 years ago

Or am I barking up the wrong tree here because the range tombstone here came from replica GC instead of from a snapshot ingestion?

Yes, exactly. This range deletion tombstone wasn't ingested in an SST.

tbg commented 5 years ago

For reproing this, it's probably a good idea to have a loop that takes RocksDB checkpoints at intervals on top of keeping the WAL forever. If the problem is related to compactions, the WAL itself won't tell us anything except that it can confirm that the tombstones were put down and were lost. For the exact mechanics of how they were lost, we would need logs about the RocksDB compactions (I assume we can crank up the Rocks log level to get that) and ideally also checkpoints "sort of close" to the event.

petermattis commented 5 years ago

If this is a problem with a lost range deletion tombstone, @ajkr should be involved.

nvanbenschoten commented 5 years ago

We saw this again on cyan-06. This time, the new assertion we added in #40441 fired when creating the replica. Just like before, we see the replica (r2758/2) get destroyed and then the assertion fires 12 hours later.

cockroach.cockroach-cyan-06.root.2019-09-09T07_08_49Z.096265.log:I190909 07:14:16.632329 67681 storage/store.go:2613  [n6,replicaGC,s6,r2758/2:/Table/53/1/-785{8180…-0161…}] removing replica r2758/2
cockroach.cockroach-cyan-06.root.2019-09-09T07_08_49Z.096265.log:I190909 07:14:16.635354 67681 storage/replica_destroy.go:152  [n6,replicaGC,s6,r2758/2:/Table/53/1/-785{8180…-0161…}] removed 179837 (179831+6) keys in 3ms [clear=1ms commit=2ms]
cockroach.cockroach-cyan-06.root.2019-09-09T17_10_53Z.005024.log:F190909 19:16:37.053891 247 storage/store.go:4053  [n6,s6] found non-zero HardState.Commit on uninitialized replica [n6,s6,r2758/?:{-}]. HS={Term:27 Vote:5 Commit:121484 XXX_unrecognized:[]}
nvanbenschoten commented 5 years ago

This appears to have the same symptoms. The Range local and userspace keys have been deleted but the RangeID local keys remain.

@ajkr are you available to pair on debugging this further? Knowing where to look from here is hitting the limits of my knowledge.

ajwerner commented 5 years ago

Have we been fishing for a fast repro for this? If so, check out https://github.com/cockroachdb/cockroach/pull/40868, it seems to do it reliably.