cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.77k stars 3.76k forks source link

kvserver: v22.1.4: nil pointer in poisonInflightLatches #86547

Open cockroach-teamcity opened 2 years ago

cockroach-teamcity commented 2 years ago

This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry link: https://sentry.io/organizations/cockroach-labs/issues/3521479942/?referrer=webhooks_plugin

Panic message:

panic.go:1038: runtime error: invalid memory address or nil pointer dereference (1) attached stack trace -- stack trace: | runtime.gopanic | GOROOT/src/runtime/panic.go:1038 | runtime.panicmem | GOROOT/src/runtime/panic.go:221 | runtime.sigpanic | GOROOT/src/runtime/signal_unix.go:735 | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(Replica).poisonInflightLatches | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1384 | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(replicaCircuitBreaker).asyncProbe.func1 | github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:218 | github.com/cockroachdb/cockroach/pkg/util/stop.(Stopper).RunAsyncTaskEx.func2 | github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:494 | runtime.goexit | GOROOT/src/runtime/asm_amd64.s:1581 Wraps: (2) runtime error: invalid memory address or nil pointer dereference Error types: (1) withstack.withStack (2) runtime.errorString -- report composition: runtime.errorString panic.go:1038: *withstack.withStack (top exception)

Stacktrace (expand for inline code snippets): GOROOT/src/runtime/panic.go#L1037-L1039 in runtime.gopanic GOROOT/src/runtime/panic.go#L220-L222 in runtime.panicmem GOROOT/src/runtime/signal_unix.go#L734-L736 in runtime.sigpanic https://github.com/cockroachdb/cockroach/blob/3c6c8933f578a7fd140e24a603d6ec64c6b7a834/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go#L1383-L1385 in pkg/kv/kvserver.(*Replica).poisonInflightLatches https://github.com/cockroachdb/cockroach/blob/3c6c8933f578a7fd140e24a603d6ec64c6b7a834/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go#L217-L219 in pkg/kv/kvserver.(*replicaCircuitBreaker).asyncProbe.func1 https://github.com/cockroachdb/cockroach/blob/3c6c8933f578a7fd140e24a603d6ec64c6b7a834/pkg/util/stop/stopper.go#L493-L495 in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2 GOROOT/src/runtime/asm_amd64.s#L1580-L1582 in runtime.goexit
GOROOT/src/runtime/panic.go in runtime.gopanic at line 1038
GOROOT/src/runtime/panic.go in runtime.panicmem at line 221
GOROOT/src/runtime/signal_unix.go in runtime.sigpanic at line 735
pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go in pkg/kv/kvserver.(*Replica).poisonInflightLatches at line 1384
pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go in pkg/kv/kvserver.(*replicaCircuitBreaker).asyncProbe.func1 at line 218
pkg/util/stop/stopper.go in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2 at line 494
GOROOT/src/runtime/asm_amd64.s in runtime.goexit at line 1581
Tag Value
Cockroach Release v22.1.4
Cockroach SHA: 3c6c8933f578a7fd140e24a603d6ec64c6b7a834
Platform linux amd64
Distribution CCL
Environment v22.1.4
Command server
Go Version ``
# of CPUs
# of Goroutines

Jira issue: CRDB-18809

Epic CRDB-39898

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

erikgrinaker commented 1 year ago

Can you have a look @tbg?

tbg commented 1 year ago

This panics here:

https://github.com/cockroachdb/cockroach/blob/d10bde7a099716fcc4c3270c67996899fb121a01/pkg/kv/kvserver/replica_raft.go#L1384

At this point we know that p.ec is not nil. Since .Req.PoisonPolicy all sits on *Guard, it must be the case that p.ec.g is nil:

https://github.com/cockroachdb/cockroach/blob/0a2396a5b86399bc4b2e301348354ed896be8f32/pkg/kv/kvserver/replica_send.go#L1159-L1163

I poked around a bit and I'm not aware that we should ever have a proposal with a nil *Guard in the proposals map. It is possible for SequenceReq to return a nil *Guard, but this is the case in which it directly returns a request, and such proposals ought not to end up below raft.

@nvanbenschoten do you expect to see a nil Guard here?

tbg commented 1 year ago

@nvanbenschoten via DM

I did find that in endCmds.poison, we have an // Already cleared. branch. That indicates to me that it is possible to have a proposal with a nil *Guard in the proposals map.

What's implicit here is that Nathan does not expect endCmds to be created and put into the map with a nil Guard, it must've been zeroed out later, probably as part of ec.done().

tbg commented 1 year ago

Looked at this again a bit. Still haven't sussed it all out but I think it's reasonable to assume that some proposals in the map can be finished (so their endCmds are cleared) without strict sync with the proposals map. It's cleared only from p.finishApplication but that method has a few callers including from the proposal buf, which also has to deal with reproposals, etc. Will scrutinize all of this code and probably send a PR that makes it a bit clearer on master. On the older releases (22.1 etc) we should just backport the nil check for p.ec.g.

msbutler commented 1 year ago

a heads up that this failure surfaced in a nightly run of backupccl's TestDataDriven/restore-schema-only-multiregion here. Unfortunately, we were not quick enough to triage before artifact deletion. Stressing this test on master has lead to a different failure I've outlined here related to server startup.

stevendanna commented 1 year ago

We've seen a similar looking failure again on TestDataDriven. We have aftifacts this time if they are helpful: https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_StressBazel/8179790:id/tmp/_tmp/95e138e66d69292427dfb9528cf06d04/logTestDataDriven4271694788/backupccltest.log

E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627  a panic has occurred!
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +runtime error: invalid memory address or nil pointer dereference
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +(1) attached stack trace
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  -- stack trace:
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | runtime.gopanic
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  |   GOROOT/src/runtime/panic.go:884
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | runtime.panicmem
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  |   GOROOT/src/runtime/panic.go:260
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | runtime.sigpanic
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  |   GOROOT/src/runtime/signal_unix.go:835
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).poisonInflightLatches
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1363
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaCircuitBreaker).asyncProbe.func1
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  |   github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:200
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  |   github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  | runtime.goexit
E230104 08:37:44.513251 500410339 1@util/log/logcrash/crash_reporting.go:188 ⋮ [n1,s1,r49/1:‹/Table/4{7-8}›] 62627 +  |   GOROOT/src/runtime/asm_amd64.s:1594
tbg commented 1 year ago

I think it will be easy to "fix" the bug - just add a nil check - but I think there is something to be understood yet. I am still not seeing how we can end up with a proposal in r.mu.proposals that is finished.

I went through all of the callers to endCmds.done below and verified that they all make sure that the proposal is not in the map by the time .done() is called.

image

Most of them "obviously" do (for some of them it's clear that the proposal is never added to the map in the first place). There are two exceptions:

AckOutcomeAndFinish

At the beginning of command application, a replicaAppBatch will be created and will grab^2 the proposals from the map (removing them). However, it doesn't do this for all proposals, only for those that aren't superseded by a reproposal with a higher LeaseAppliedIndex[^1]. The superseded ones could be of interested - they are still in the map during application, so perhaps we're signalling them? There are in fact potential problems here.

The first is in the code below, which will signal all proposals, including superseded ones, meaning they will be signaled & stay in the map:

https://github.com/cockroachdb/cockroach/blob/edf6976bfe9ed23b047575d65f2597005757dec3/pkg/kv/kvserver/apply/task.go#L260-L270

The one error we expect to see from applyOneBatch (thus triggering this code path) is ErrRemoved. It's unlikely that we saw this one in this instance, since the system was basically defunct for over a minute and so replicaGC is not too likely, plus we don't see log messages that should have occurred prior to the crash. But I'll make a note that something is wrong here.

I also considered whether we might be signaling proposals that are in the map in the error path here:

https://github.com/cockroachdb/cockroach/blob/54e3708c8ccc850e5edd7a77c74bdb70aacde3a5/pkg/kv/kvserver/replica_application_result.go#L111-L120

but I had recently convinced myself[^1] that a proposal that hits this path was necessarily removed from the map already, in retrieveLocalProposals^2.

refreshProposalsLocked

This one is more interesting. Even though that method holds the appropriate locks throughout and does remove commands from the map prior to finishing them (in the few cases in which it does do that), what it does do in the common case is hand proposals that are in the map back to the proposal buffer (without finishing them):

https://github.com/cockroachdb/cockroach/blob/ac3e633279c7333fe6544d1376593f83f57a5add/pkg/kv/kvserver/replica_raft.go#L1345-L1347

This is possibly problematic - we now have a (yet unfinished, presumably) proposal in both the map and the proposal buffer. Could the proposal now be applied, finished, removed from the map but then be reinserted^3 due to being present in the proposal buffer? I'm actually not convinced this can happen, because the proposal buffer is flushed right at the beginning of raft processing (which includes entry application), meaning that by the time we might be applying the proposal the proposal buffer has already been emptied out^4 and won't re-insert into the map.

https://github.com/cockroachdb/cockroach/blob/ac3e633279c7333fe6544d1376593f83f57a5add/pkg/kv/kvserver/replica_raft.go#L718-L719


Next steps

The lifecycle of a proposal is pretty ad-hoc but there is clearly an invariant I am assuming should hold - that there isn't ever a finished proposal in the proposal map. We are not checking this invariant, but we should. We add to the map in a single place only^3 and should verify this here. If we then still see the crash but not the assertion - we know the proposal was finished while remaining in the map, still in violation of the invariant.

As outlined above, we know of at least one invariant violation (the snippet in the AckOutcomeAndFinish section) but maybe I missed another one. Actually, revisiting the log of the failing test above, maybe it is replicaGC-related after all, since this is a multi-region test and I see the replicaGCQueue repeatedly run into the circuit breaker. But I'm still unsure why there isn't any logging^5 then.

[^1]: better docs on this stuff incoming in https://github.com/cockroachdb/cockroach/pull/94633

msbutler commented 1 year ago

@tbg, this unrelated test has been frequently failing due to this error. If you think the review process for #94825 will take some time, I may skip the test to avoid CI flakiness.

tbg commented 1 year ago

Unassigning since I'm no longer working on this and we've made the code path resilient to avoid the crash. It's likely that Epic CRDB-25287 will solve the underlying problem, which likely had to do with a finished proposal being re-inserted into the proposals map, which is something we know is possible at the moment. (This is not thought to cause double-application but can cause crashes in code that assumes it's not possible).

tbg commented 1 year ago

The reproposal path has been simplified quite a bit as a result of CRDB-25287 and the problem is more or less understood:

https://github.com/cockroachdb/cockroach/pull/106750/files#diff-58a18041db5281b4abe232561907fa9683d5c58c6704679e82dc8964aab4ecfaR1304-R1316

The work-around is still in place above and below:

https://github.com/cockroachdb/cockroach/blob/fe1c4a052eef247fb932933b67dd50129e86c711/pkg/kv/kvserver/replica_raft.go#L1496-L1512

but armed with this comment

https://github.com/cockroachdb/cockroach/pull/106750/files#diff-488a090afc4b6eaf56cd6d13b347bac67cb3313ce11c49df9ee8cd95fd73b3e8R57-R73

we should feel comfortable going through another round of assertions (best using must^1). Either way, no user should currently be impacted by this. I'll keep it open, though.