cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.85k stars 3.77k forks source link

roachtest: splits/largerange/size=32GiB,nodes=6 failed #123642

Closed cockroach-teamcity closed 1 month ago

cockroach-teamcity commented 4 months ago

roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on release-23.2 @ a36883a03ce0b30af0c025ff2b0de2f3845ad8d7:

(monitor.go:153).Wait: monitor failure: bank table split over 359 ranges, expected at least 509
test artifacts and logs in: /artifacts/splits/largerange/size=32GiB_nodes=6/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-38447

andrewbaptist commented 4 months ago

The cockroach process was killed by the OOM killer, but I'm not exactly sure why.

From the test logs:

14:20:28 monitor.go:177: Monitor event: n6: cockroach process for system interface died (exit code 15)
14:20:28 test_impl.go:414: test failure #1: full stack retained in failure_1.log: (monitor.go:153).Wait: monitor failure: bank table split over 359 ranges, expected at least 509

From the n6 cockroach log, the last couple of lines are:

W240505 14:20:12.024960 749 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n6,rnode=4,raddr=‹10.142.2.6:29000›,class=system,rpc] 217  latency jump (prev avg 0.47ms, current 1277.86ms)
I240505 14:20:12.320505 260 kv/kvserver/replica_raft.go:1587 ⋮ [T1,Vsystem,n6,s6,r3/3:‹/System/{NodeLive…-tsd}›,raft] 218  slow non-blocking raft commit: commit-wait 1.568110763s sem 649ns

From the kernel log we see:

May 05 14:20:28 teamcity-15118476-1714888324-60-n6cpu4-0006 systemd[1]: cockroach-system.service: A process of this unit has been killed by the OOM killer.
May 05 14:20:28 teamcity-15118476-1714888324-60-n6cpu4-0006 kernel: cockroach invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0

I'm attaching the memprof.2024-05-05T14_20_06.801.849512416.pprof.gz from before the crash. It appears almost all the memory is in the handling of adding snapshots:

image

We should look at addressing this as part of the throttling for 24.2, and I'm not sure why we had this one failure.

cockroach-teamcity commented 2 months ago

roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on release-23.2 @ bd06d2ca3bf66b8924eeb4f34ac81d42ceffe072:

(sql_runner.go:260).Query: error executing 'SHOW RANGES FROM TABLE bank.bank': read tcp 172.17.0.3:48582 -> 35.190.182.21:29000: read: connection reset by peer
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/splits/largerange/size=32GiB_nodes=6/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

github-actions[bot] commented 1 month ago

We have marked this test failure issue as stale because it has been inactive for 1 month. If this failure is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the test failure queue tidy.