Open cockroach-teamcity opened 1 year ago
cc @cockroachdb/replication
Everything got redacted away, naturally, but the stack trace shows that we're hitting this when a replica splits.
There is only a single event in the report, from an official release binary, and it's a v32cpu machine in an at least three node cluster (since nodeID is 3). This looks like a legit deployment. I don't see any other events for the cluster, but I don't even find this very one via the search, so I don't think that's supporting evidence - instead it just reflects the fact that this report is nearing its three-month anniversary.
A precondition for calling into this code is having the node liveness as prefix:
So one possibility would be that we accidentally split the node liveness range. We're not supposed to split it, and there are checks against this:
used by
which is called in
and this method is involved in every split, which I verified via this on a test cluster:
for i, k := range []roachpb.Key{
keys.NodeLivenessPrefix.Next(),
keys.NodeLivenessKey(123),
} {
_, _, err := tc.SplitRange(k)
require.Errorf(t, err, "expected error for idx %d at key %s", i, k)
}
Additionally, the liveness range is fully split out at cluster bootstrap
However, it looks like merges can affect the liveness range 😬
merge right neighbor:
client_merge_test.go:161: before: r2:/System/NodeLiveness{-Max} [(n1,s1):1, next=2, gen=0]
client_merge_test.go:165: adminmerge: <nil>
client_merge_test.go:168: after: r2:/System/{NodeLiveness-tsd} [(n1,s1):1, next=2, gen=1]
getting merged by left neighbor:
client_merge_test.go:160: before: r2:/System/NodeLiveness{-Max} [(n1,s1):1, next=2, gen=0]
client_merge_test.go:164: adminmerge: <nil>
client_merge_test.go:167: after: r1:/{Min-System/NodeLivenessMax} [(n1,s1):1, next=2, gen=1]
This is invoking AdminMerge
directly, but I don't see anything in the merge queue that would prevent this. Enqueuing r2 on a roachprod cluster I had laying around, I see that it won't merge:
kv/kvserver/merge_queue.go:295 [n3,s3,r2/5:/System/NodeLiveness{-Max}] skipping merge to avoid thrashing: merged range r0:/System/{NodeLiveness-tsd} [
, next=0, gen=0] may split (estimated size: 571021)
because the queue realizes that upon the merge, a split would be required right away. If I played the interface golf correctly, the actual implementation starts here:
and the meat is here (the caller will fatal on error):
What's interesting here is that this is populated asynchronously and starts out empty. So it seems conceivable that if the rangefeed that provides the spanconfig updates is delayed, we could somehow get into a state where the merge queue entertains merging the node liveness range.
It seems to be a little worse on master
(which presumably matches 23.1 too), because shouldSplitRange
returns false
on errors as well:
The assumption in shouldSplitRange
is clearly that "doing nothing is safe". But in the merge caller, saying that you don't need to split can lead to "un-splitting" (merging) the liveness range, which is ¡no bueno!
It's hard to say what exactly happened in the cluster - did an ill-advised direct AdminMerge
get directed at liveness range or its left neighbor; was there some delayed spanconfig scenario which allowed the merge queue to split the liveness range; or something else entirely - what seems clear is that we should very directly protect static splits from being undone by merges.
Great analysis! However, I'm not clear how we split the range.
My understanding of what happened is something like:
Original
|----- Liveness ----- | ----- Next Range -----|
Post merge (root cause unknown, but not safely prevented). LIkely due to either manual AdminMerge
or a timing race in spanConfigStore.computeSplitKey
which returned an err.
|----- Liveness ----------- Next Range -----|
shouldSplit() == true
for this merged range now.
Attempt split. Why is this split key chosen.
|----- L1 -|--L2--------- Next Range -----|
At this point, both the ranges will be "priority for raft" and it will cause panic.
What I don't understand is how ComputeSplitKey can ever return a range within the liveness range. It looks at the staticSplits
first and should split at the end of the liveness range.
The one way I can see this happening is if COCKROACH_DISABLE_SPAN_CONFIGS
was set. In this case, it would use the noopKVSubscriber
which returns a dummy key, but that also always returns false for NeedsSplit
.
Regardless, I don't think it should be a bug that we merge the liveness range with its neighbor, it should be discouraged since it will re-split immediately. It does seem to be a bug that we are splitting it right afterward.
As a side note, this would not have crashed in 23.2 once #101023 is merged.
My guess is that liveness was merged into the meta range, and then split back out because of the static split points. At that point, the old liveness range was already registered as a priority range in the scheduler, and adding the new one will panic (we never unregister these priority ranges since the liveness range never changes its ID).
I have a prototype to disallow these splits in https://github.com/cockroachdb/cockroach/pull/101727. I think it works, but various tests need to be updated (since they do violate the static splits). This could be done with a testing knob. I'm not working on this any more to focus on more urgent tasks until my extended leave.
This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.
Sentry link: https://sentry.io/organizations/cockroach-labs/issues/3910802942/?referrer=webhooks_plugin
Panic message:
Stacktrace (expand for inline code snippets):
GOROOT/src/runtime/panic.go#L883-L885 in runtime.gopanic https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go#L126-L128 in pkg/kv/kvserver.(*rangeIDQueue).SetPriorityID https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go#L242-L244 in pkg/kv/kvserver.(*raftScheduler).SetPriorityID https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go#L370-L372 in pkg/kv/kvserver.(*Replica).setDescLockedRaftMuLocked https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/pkg/kv/kvserver/replica_init.go#L207-L209 in pkg/kv/kvserver.(*Replica).loadRaftMuLockedReplicaMuLocked https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/pkg/kv/kvserver/store_split.go#L246-L248 in pkg/kv/kvserver.prepareRightReplicaForSplit https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/pkg/kv/kvserver/store_split.go#L162-L164 in pkg/kv/kvserver.splitPostApply https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_result.go#L250-L252 in pkg/kv/kvserver.(*Replica).handleSplitResult https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go#L1356-L1358 in pkg/kv/kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go#L1231-L1233 in pkg/kv/kvserver.(*replicaStateMachine).ApplySideEffects https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/apply/cmd.go#L205-L207 in pkg/kv/kvserver/apply.mapCheckedCmdIter https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/apply/task.go#L289-L291 in pkg/kv/kvserver/apply.(*Task).applyOneBatch https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/apply/task.go#L245-L247 in pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go#L1044-L1046 in pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go#L663-L665 in pkg/kv/kvserver.(*Replica).handleRaftReady https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go#L640-L642 in pkg/kv/kvserver.(*Store).processReady https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go#L307-L309 in pkg/kv/kvserver.(*raftScheduler).worker https://github.com/cockroachdb/cockroach/blob/cf1e7e6bc6ef9e55510b9ed13bd068bb3894cd92/pkg/util/stop/stopper.go#L488-L490 in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2 GOROOT/src/runtime/asm_amd64.s#L1593-L1595 in runtime.goexitv22.2.1
Jira issue: CRDB-24089