cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.19k stars 3.82k forks source link

raft: PANIC inflights buffer is never resized beyond the initial size #135223

Closed kvoli closed 2 weeks ago

kvoli commented 2 weeks ago

Seen in this OOB panic:


I241114 21:22:22.551442 12898 kv/kvserver/replica_raftstorage.go:521 ⋮ [T1,Vsystem,n2,s3,r35/9:‹/Table/3{2-3}›] 291  applied snapshot df45e38e from (n9,s17):7 at applied index 74 ‹as write ›(total=1ms data=2.2 KiB excise=true ingestion=6@1ms)
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292  a panic has occurred!
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +runtime error: index out of range [128] with length 128
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +(1) attached stack trace
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  -- stack trace:
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | runtime.gopanic
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       GOROOT/src/runtime/panic.go:770
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | runtime.goPanicIndex
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       GOROOT/src/runtime/panic.go:114
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/raft/tracker.(*Inflights).Add
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/raft/tracker/inflights.go:84
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/raft/tracker.(*Progress).SentEntries
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/raft/tracker/progress.go:191
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/raft.(*raft).prepareMsgApp
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/raft/raft.go:625
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/raft.(*raft).maybePrepareMsgApp
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/raft/raft.go:654
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/raft.(*RawNode).SendMsgApp
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/raft/rawnode.go:210
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/kvflowcontrol/replica_rac2.raftNodeForRACv2.SendMsgAppRaftMuLocked
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/kv/kvserver/kvflowcontrol/replica_rac2/raft_node.go:74
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/kvflowcontrol/rac2.(*replicaSendStream).handleReadyEntriesRaftMuAndStreamLocked
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/kv/kvserver/kvflowcontrol/rac2/range_controller.go:2651
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/kvflowcontrol/rac2.(*replicaState).handleReadyEntriesRaftMuLocked
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/kv/kvserver/kvflowcontrol/rac2/range_controller.go:2295
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/kvflowcontrol/rac2.(*rangeController).HandleRaftEventRaftMuLocked
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/kv/kvserver/kvflowcontrol/rac2/range_controller.go:1172
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/kvflowcontrol/replica_rac2.(*processorImpl).HandleRaftReadyRaftMuLocked
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/kv/kvserver/kvflowcontrol/replica_rac2/processor.go:784
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/kv/kvserver/replica_raft.go:1017
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/kv/kvserver/replica_raft.go:836
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/kv/kvserver/store_raft.go:682
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/kv/kvserver/scheduler.go:419
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/kv/kvserver/scheduler.go:319
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       pkg/util/stop/stopper.go:498
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  | runtime.goexit
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +  |       src/runtime/asm_amd64.s:1695
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +Wraps: (2) runtime error: index out of range [128] with length 128
E241114 21:22:34.036920 406 1@util/log/logcrash/crash_reporting.go:192 ⋮ [T1,Vsystem,n2] 292 +Error types: (1) *withstack.withStack (2) runtime.boundsError

Which is due to the grow function not actually allowing growth beyond Inflights.size:

https://github.com/cockroachdb/cockroach/blob/ef26d1534507f6089aa909e0bd43a0efb71dd6c7/pkg/raft/tracker/inflights.go#L96-L98

This affects master and v24.3. It has only reproduced with perturbation/full/backfill with appply_to_all set for kvadmission.flow_control.mode and the elastic disk BW limiter enabled.

Tentatively marking as a GA blocker. The fix appears simple.

Jira issue: CRDB-44405

blathers-crl[bot] commented 2 weeks ago

Based on the specified backports for linked PR #135237, I applied the following new label(s) to this issue: branch-release-24.3.0-rc. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

kvoli commented 2 weeks ago

Will be closed on https://github.com/cockroachdb/cockroach/pull/135279 and https://github.com/cockroachdb/cockroach/pull/135277.