cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.03k stars 3.8k forks source link

server: NPE in span_stats_server when calling spanStatsFanOut #132130

Closed kyle-a-wong closed 3 days ago

kyle-a-wong commented 1 week ago

Describe the problem

Please describe the issue you observed, and any steps we can take to reproduce it:

in v23.2.10, operating on an unfinalized cluster and will see node restarts because of an NPE (Stack trace below):

github.com/cockroachdb/cockroach/pkg/server.(*systemStatusServer).spanStatsFanOut.func3(0x3b9bc28?, {0x60d27e0?, 0xc1cc24ae60?})
        github.com/cockroachdb/cockroach/pkg/server/span_stats_server.go:116 +0x114
github.com/cockroachdb/cockroach/pkg/server.(*statusServer).iterateNodes(0xc020471780, {0x7a7a7e8, 0xc033fa4930}, {0x640986c, 0x1e}, 0xdf8475800, 0xc0a72bb650, 0xc0a72bb668, 0xc003b9bde0, 0xc003b9bdf8)
        github.com/cockroachdb/cockroach/pkg/server/status.go:3141 +0x557
github.com/cockroachdb/cockroach/pkg/server.(*systemStatusServer).spanStatsFanOut(0xc02040a8c0, {0x7a7a7e8?, 0xc033fa4930}, 0xc3423e83c0)
        github.com/cockroachdb/cockroach/pkg/server/span_stats_server.go:138 +0x41b
github.com/cockroachdb/cockroach/pkg/server.(*systemStatusServer).getSpanStatsInternal(0x594c2a0?, {0x7a7a7e8, 0xc033fa4930}, 0xc3423e83c0)
        github.com/cockroachdb/cockroach/pkg/server/span_stats_server.go:287 +0x38
github.com/cockroachdb/cockroach/pkg/server.batchedSpanStats({0x7a7a7e8, 0xc033fa4930}, 0xc3423e83c0, 0xc003b9c108, 0x3e8)
        github.com/cockroachdb/cockroach/pkg/server/span_stats_server.go:464 +0x2ce
github.com/cockroachdb/cockroach/pkg/server.(*systemStatusServer).SpanStats(0xc02040a8c0, {0x7a7a7e8?, 0xc033fa47e0?}, 0x7a7a7e8?)
        github.com/cockroachdb/cockroach/pkg/server/status.go:3673 +0x127
github.com/cockroachdb/cockroach/pkg/sql.(*planner).SpanStats(0xc1997b6670, {0x7a7a7e8, 0xc033fa47e0}, {0xc0076f0000, 0x1d684, 0x21155})
        github.com/cockroachdb/cockroach/pkg/sql/planner.go:943 +0xb1
github.com/cockroachdb/cockroach/pkg/sql/sem/builtins.(*spanStatsValueGenerator).Start(0xc300bcf0a0, {0x7a7a7e8?, 0xc033fa47e0?}, 0x61b8680?)
        github.com/cockroachdb/cockroach/pkg/sql/sem/builtins/generator_builtins.go:3473 +0x3c

This code hasn't been changed since v23.2.10, so it seems likely that this some NPE can occur in more recent versions as well

To Reproduce

// TODO

Additional data / screenshots

The stack trace above points to this line in the code: https://github.com/cockroachdb/cockroach/blob/c68c559859be738efead9971f5e11f62a8c69d06/pkg/server/span_stats_server.go#L147

so the 2 suspected culprits for the NPE are res.SpanToStats[spanStr] and spanStats.TotalStats

Environment: The issue was experienced in v23.2.10, but it is likely that this bug exists in the main branch as well.

Additional context What was the impact? Nodes were restarting due to panics from the NPE

Jira issue: CRDB-42842

blathers-crl[bot] commented 1 week ago

Hi @kyle-a-wong, please add branch-* labels to identify which branch(es) this C-bug affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.