Closed cockroach-teamcity closed 1 month ago
The cockroach process was killed by the OOM killer, but I'm not exactly sure why.
From the test logs:
14:20:28 monitor.go:177: Monitor event: n6: cockroach process for system interface died (exit code 15)
14:20:28 test_impl.go:414: test failure #1: full stack retained in failure_1.log: (monitor.go:153).Wait: monitor failure: bank table split over 359 ranges, expected at least 509
From the n6 cockroach log, the last couple of lines are:
W240505 14:20:12.024960 749 2@rpc/clock_offset.go:291 ⋮ [T1,Vsystem,n6,rnode=4,raddr=‹10.142.2.6:29000›,class=system,rpc] 217 latency jump (prev avg 0.47ms, current 1277.86ms)
I240505 14:20:12.320505 260 kv/kvserver/replica_raft.go:1587 ⋮ [T1,Vsystem,n6,s6,r3/3:‹/System/{NodeLive…-tsd}›,raft] 218 slow non-blocking raft commit: commit-wait 1.568110763s sem 649ns
From the kernel log we see:
May 05 14:20:28 teamcity-15118476-1714888324-60-n6cpu4-0006 systemd[1]: cockroach-system.service: A process of this unit has been killed by the OOM killer.
May 05 14:20:28 teamcity-15118476-1714888324-60-n6cpu4-0006 kernel: cockroach invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
I'm attaching the memprof.2024-05-05T14_20_06.801.849512416.pprof.gz from before the crash. It appears almost all the memory is in the handling of adding snapshots:
We should look at addressing this as part of the throttling for 24.2, and I'm not sure why we had this one failure.
roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on release-23.2 @ bd06d2ca3bf66b8924eeb4f34ac81d42ceffe072:
(sql_runner.go:260).Query: error executing 'SHOW RANGES FROM TABLE bank.bank': read tcp 172.17.0.3:48582 -> 35.190.182.21:29000: read: connection reset by peer
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/splits/largerange/size=32GiB_nodes=6/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=4
ROACHTEST_encrypted=false
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0
See: roachtest README
See: How To Investigate (internal)
See: Grafana
We have marked this test failure issue as stale because it has been inactive for 1 month. If this failure is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the test failure queue tidy.
roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on release-23.2 @ a36883a03ce0b30af0c025ff2b0de2f3845ad8d7:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=4
ROACHTEST_encrypted=false
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
/cc @cockroachdb/kv-triageThis test on roachdash | Improve this report!
Jira issue: CRDB-38447