Closed spaskob closed 1 year ago
This feels like a thing @andreimatei would know about. The Ctx()
management is detailed here:
https://github.com/cockroachdb/cockroach/blob/84d8304554982797d8abfbbade346f142ea670fb/pkg/sql/conn_executor.go#L209-L228
Still not obvious to me why this would ever return nil when transitioning through commit.
Some interesting logs just before the crash:
17:23:44.513108 190357 server/authentication.go:399 Web session error: http: named cookie not present
17:23:45.613124 190467 server/authentication.go:399 Web session error: http: named cookie not present
17:23:46.697268 190586 server/authentication.go:399 Web session error: http: named cookie not present
17:23:47.801236 190678 server/authentication.go:399 Web session error: http: named cookie not present
17:23:48.966350 190802 server/authentication.go:399 Web session error: http: named cookie not present
17:23:51.142537 190965 server/authentication.go:399 Web session error: http: named cookie not present
17:23:52.683448 185249 sql/conn_executor.go:796 [n1,client=XXXXXX,hostssl,user=XXXX] a SQL panic has occurred while executing "COMMIT TRANSACTION": runtime error: invalid memory address or nil pointer dereference
Here are some logs from a previous crash:
vendor/google.golang.org/grpc/clientconn.go:1206 grpc: addrConn.createTransport failed to connect to {XXXXXX-node1:26258 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp XXXX:XXXXXX: connect: connection refused". Reconnecting...
E200710 14:24:49.221823 29951270 sql/conn_executor.go:796 [n1,client=....] a SQL panic has occurred while executing "COMMIT TRANSACTION": runtime error: invalid memory address or nil pointer dereference
I've found a case where I think we mess up the context, but it doesn't look like that's the case this report is hitting.
The case in which I think we're leaving a nil ctx is here:
https://github.com/cockroachdb/cockroach/blob/3639ec53c03c9606f187577d684712c6e0bc9427/pkg/sql/conn_fsm.go#L368-L369
This event is generated here when the connExecutor
is closing (i.e. when the pgwire connection is terminated). This event leaves the connExecutor
in stateAborted
but also clears txnState.Ctx
(through cleanupAndFinishOnError()
. That's not kosher, since connEx.Ctx()
expects there to be a non-nil txnState.Ctx
whenever the state is not stateNoTxn
. Even though the executor is closing, I don't think this is benign. I think we might try to use that nil ctx and crash.
However, in this report the connExecutor
is not closing. And so I don't know how ex.Ctx
returns nil. Unless someone else figures it out, I think we need more instrumentation - like an assertion in connEx.Ctx()
that logs the current ex.machine.CurState()
if it ever wants to return nil. I've ran a bunch of tests with this assertion and didn't hit anything.
Spun out the connection close issue in #51467
Hey @andreimatei here is the new failing message with the instrumented binary that you provided:
F200717 06:20:30.156151 4050 sql/conn_executor.go:2167 nil ctx; see https://github.com/cockroachdb/cockroach/issues/51460
input event: sql.eventNonRetriableErr{IsCommit:true}, old state: Aborted, new state: Aborted, txn event: txnRollback, advance: skipBatch, stmt: COMMIT TRANSACTION, connCtx: 0xc004d6afc0, hijackedCtx: %!p(<nil>), txn ctx: %!p(<nil>)
goroutine 4050 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x7526c00, 0xed6a33a2e, 0x0, 0x12)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x75238c0, 0xc000000004, 0x6a6cbec, 0x14, 0x877, 0xc005fa2640, 0x12a)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:210 +0xa29
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x4bd0120, 0xc0000f0090, 0x4, 0x2, 0x440fe37, 0xc2, 0xc0064994b0, 0x9, 0x9)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2c9
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x4bd0120, 0xc0000f0090, 0x1, 0x4, 0x440fe37, 0xc2, 0xc0064994b0, 0x9, 0x9)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:155
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).txnStateTransitionsApplyWrapper(0xc003fd7000, 0x4b4d220, 0xc00a91e390, 0x3edc4e0, 0xc00a91e3a0, 0x7f109d4f8ea8, 0xc0087ac680, 0x3b, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:2167 +0x8e0
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(0xc003fd7000, 0x4bd00e0, 0xc005c4dc00, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1497 +0x952
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc003fd7000, 0x4bd00e0, 0xc004d6afc0, 0xc000100100, 0x5400, 0x15000, 0xc000100198, 0xc00acbbb30, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1297 +0x1f2
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000632680, 0x4bd00e0, 0xc004d6afc0, 0xc003fd7000, 0x5400, 0x15000, 0xc000100198, 0xc00acbbb30, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:476 +0x104
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1(0xc00581ed69, 0xc003cce600, 0x4bd00e0, 0xc004d6afc0, 0xc00acbbb30, 0xc000632680, 0xc0087ac000, 0x4bf3de0, 0xc0011c2c30, 0xc004e4b4a0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:591 +0x2e8
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync
/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:523 +0x17e
The full logs are at the zendesk ticket: https://cockroachdb.zendesk.com/agent/tickets/5806
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
Describe the problem
This is a stack trace from the node crash:
The nil pointer dereference is due to a
nil
context being threaded though the call stack and originally passed fromgithub.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).resetExtraTxnState(0xc006e79000, 0x0, 0x0, 0xc00032a9c0, 0x3, 0x40f82c0, 0xc006e79090)
.This is the actual function call from
conn_executor.go
:To Reproduce Unable to create a clean repro - the bug was reported by a customer.
Expected behavior We should probably check if
ex.Ctx() != nil
before callingresetExtraTxnState
.Environment:
Additional context Node crashes.
Jira issue: CRDB-4043