cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

Node crashes on txn commit error due to nil context when attempting to log #51460

Closed spaskob closed 1 year ago

spaskob commented 4 years ago

Describe the problem

This is a stack trace from the node crash:

panic: panic while executing 1 statements: COMMIT TRANSACTION; caused by runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xea9c66]

goroutine 40589045 [running]:
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).closeWrapper(0xc006e79000, 0x4bcfb20, 0xc00364d140, 0x3dec3a0, 0x6e50eb0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:810 +0x3f9
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn.func1(0xc006e79000, 0x4bcfb20, 0xc00364d140)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:474 +0x61
panic(0x3dec3a0, 0x6e50eb0)
/usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/cockroachdb/cockroach/vendor/github.com/opentracing/opentracing-go.SpanFromContext(0x0, 0x0, 0xc007e71740, 0x0)
/go/src/github.com/cockroachdb/cockroach/vendor/github.com/opentracing/opentracing-go/gocontext.go:22 +0x26
github.com/cockroachdb/cockroach/pkg/util/log.getSpanOrEventLog(0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/trace.go:91 +0x39
github.com/cockroachdb/cockroach/pkg/util/log.eventInternal(0x0, 0x0, 0x100, 0x432c3b1, 0x13, 0xc00dc0d2a8, 0x1, 0x1)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/trace.go:106 +0x4d
github.com/cockroachdb/cockroach/pkg/util/log.vEventf(0x0, 0x0, 0x0, 0x1, 0x2, 0x432c3b1, 0x13, 0xc00dc0d2a8, 0x1, 0x1)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/trace.go:175 +0x142
github.com/cockroachdb/cockroach/pkg/util/log.VEventf(...)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/trace.go:190
github.com/cockroachdb/cockroach/pkg/sql.(*TableCollection).releaseLeases(0xc006e79228, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/table.go:516 +0xd3
github.com/cockroachdb/cockroach/pkg/sql.(*TableCollection).releaseTables(0xc006e79228, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/table.go:528 +0x43
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).resetExtraTxnState(0xc006e79000, 0x0, 0x0, 0xc00032a9c0, 0x3, 0x40f82c0, 0xc006e79090)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1165 +0x83
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).txnStateTransitionsApplyWrapper(0xc006e79000, 0x4b4cc60, 0xc011958840, 0x3edc1a0, 0xc011958850, 0x7fab830953c8, 0xc0072c4180, 0x2aa, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:2165 +0x54f
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(0xc006e79000, 0x4bcfb20, 0xc009fb4340, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1497 +0x952
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc006e79000, 0x4bcfb20, 0xc00364d140, 0xc00001fc00, 0x5400, 0x15000, 0xc00001fc98, 0xc0058db350, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1297 +0x1f2
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000c77600, 0x4bcfb20, 0xc00364d140, 0xc006e79000, 0x5400, 0x15000, 0xc00001fc98, 0xc0058db350, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:476 +0x104
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1(0xc009538dbd, 0xc017428ba0, 0x4bcfb20, 0xc00364d140, 0xc0058db350, 0xc000c77600, 0xc0072c3b00, 0x4bf3820, 0xc01c9a6060, 0xc00a292ea0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:591 +0x2e8
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync
/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:523 +0x17e

The nil pointer dereference is due to a nil context being threaded though the call stack and originally passed from github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).resetExtraTxnState(0xc006e79000, 0x0, 0x0, 0xc00032a9c0, 0x3, 0x40f82c0, 0xc006e79090).

This is the actual function call from conn_executor.go:

    case txnRestart, txnRollback:
        if err := ex.resetExtraTxnState(ex.Ctx(), ex.server.dbCache, advInfo.txnEvent); err != nil {
            return advanceInfo{}, err
        }

To Reproduce Unable to create a clean repro - the bug was reported by a customer.

Expected behavior We should probably check if ex.Ctx() != nil before calling resetExtraTxnState.

Environment:

Additional context Node crashes.

Jira issue: CRDB-4043

ajwerner commented 4 years ago

This feels like a thing @andreimatei would know about. The Ctx() management is detailed here: https://github.com/cockroachdb/cockroach/blob/84d8304554982797d8abfbbade346f142ea670fb/pkg/sql/conn_executor.go#L209-L228

Still not obvious to me why this would ever return nil when transitioning through commit.

spaskob commented 4 years ago

Some interesting logs just before the crash:

17:23:44.513108 190357 server/authentication.go:399  Web session error: http: named cookie not present
17:23:45.613124 190467 server/authentication.go:399  Web session error: http: named cookie not present
17:23:46.697268 190586 server/authentication.go:399  Web session error: http: named cookie not present
17:23:47.801236 190678 server/authentication.go:399  Web session error: http: named cookie not present
17:23:48.966350 190802 server/authentication.go:399  Web session error: http: named cookie not present
17:23:51.142537 190965 server/authentication.go:399  Web session error: http: named cookie not present
17:23:52.683448 185249 sql/conn_executor.go:796  [n1,client=XXXXXX,hostssl,user=XXXX] a SQL panic has occurred while executing "COMMIT TRANSACTION": runtime error: invalid memory address or nil pointer dereference
spaskob commented 4 years ago

Here are some logs from a previous crash:

vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {XXXXXX-node1:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp XXXX:XXXXXX: connect: connection refused". Reconnecting...
E200710 14:24:49.221823 29951270 sql/conn_executor.go:796  [n1,client=....] a SQL panic has occurred while executing "COMMIT TRANSACTION": runtime error: invalid memory address or nil pointer dereference
andreimatei commented 4 years ago

I've found a case where I think we mess up the context, but it doesn't look like that's the case this report is hitting.

The case in which I think we're leaving a nil ctx is here: https://github.com/cockroachdb/cockroach/blob/3639ec53c03c9606f187577d684712c6e0bc9427/pkg/sql/conn_fsm.go#L368-L369 This event is generated here when the connExecutor is closing (i.e. when the pgwire connection is terminated). This event leaves the connExecutor in stateAborted but also clears txnState.Ctx (through cleanupAndFinishOnError(). That's not kosher, since connEx.Ctx() expects there to be a non-nil txnState.Ctx whenever the state is not stateNoTxn. Even though the executor is closing, I don't think this is benign. I think we might try to use that nil ctx and crash.

However, in this report the connExecutor is not closing. And so I don't know how ex.Ctx returns nil. Unless someone else figures it out, I think we need more instrumentation - like an assertion in connEx.Ctx() that logs the current ex.machine.CurState() if it ever wants to return nil. I've ran a bunch of tests with this assertion and didn't hit anything.

andreimatei commented 4 years ago

Spun out the connection close issue in #51467

spaskob commented 4 years ago

Hey @andreimatei here is the new failing message with the instrumented binary that you provided:

F200717 06:20:30.156151 4050 sql/conn_executor.go:2167 nil ctx; see https://github.com/cockroachdb/cockroach/issues/51460
input event: sql.eventNonRetriableErr{IsCommit:true}, old state: Aborted, new state: Aborted, txn event: txnRollback, advance: skipBatch, stmt: COMMIT TRANSACTION, connCtx: 0xc004d6afc0, hijackedCtx: %!p(<nil>), txn ctx: %!p(<nil>)
goroutine 4050 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x7526c00, 0xed6a33a2e, 0x0, 0x12)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x75238c0, 0xc000000004, 0x6a6cbec, 0x14, 0x877, 0xc005fa2640, 0x12a)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:210 +0xa29
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x4bd0120, 0xc0000f0090, 0x4, 0x2, 0x440fe37, 0xc2, 0xc0064994b0, 0x9, 0x9)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2c9
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x4bd0120, 0xc0000f0090, 0x1, 0x4, 0x440fe37, 0xc2, 0xc0064994b0, 0x9, 0x9)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(...)
/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:155
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).txnStateTransitionsApplyWrapper(0xc003fd7000, 0x4b4d220, 0xc00a91e390, 0x3edc4e0, 0xc00a91e3a0, 0x7f109d4f8ea8, 0xc0087ac680, 0x3b, 0x0, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:2167 +0x8e0
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(0xc003fd7000, 0x4bd00e0, 0xc005c4dc00, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1497 +0x952
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc003fd7000, 0x4bd00e0, 0xc004d6afc0, 0xc000100100, 0x5400, 0x15000, 0xc000100198, 0xc00acbbb30, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1297 +0x1f2
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000632680, 0x4bd00e0, 0xc004d6afc0, 0xc003fd7000, 0x5400, 0x15000, 0xc000100198, 0xc00acbbb30, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:476 +0x104
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1(0xc00581ed69, 0xc003cce600, 0x4bd00e0, 0xc004d6afc0, 0xc00acbbb30, 0xc000632680, 0xc0087ac000, 0x4bf3de0, 0xc0011c2c30, 0xc004e4b4a0, ...)
/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:591 +0x2e8
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync
/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:523 +0x17e
spaskob commented 4 years ago

The full logs are at the zendesk ticket: https://cockroachdb.zendesk.com/agent/tickets/5806

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!