cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.81k forks source link

roachtest: splits/load/spanning/nodes=4/obj=cpu failed #103090

Closed cockroach-teamcity closed 1 year ago

cockroach-teamcity commented 1 year ago

roachtest.splits/load/spanning/nodes=4/obj=cpu failed with artifacts on master @ 992b8aa4eea4898c8b0ee83a1da289bc1933b91a:

test artifacts and logs in: /artifacts/splits/load/spanning/nodes=4/obj=cpu/run_1
(monitor.go:127).Wait: monitor failure: monitor task failed: kv.kv has 6 ranges, expected between 2 and 5 splits

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-27827

tbg commented 1 year ago

https://github.com/cockroachdb/cockroach/blob/fd33b0a3f5daeb6045e4c9ec925db8ef8ca38ca8/pkg/cmd/roachtest/tests/split.go#L280-L297

$ cockroach debug merge-logs logs/*.unredacted | grep 'initiating a split' | grep -v 'span config'
teamcity-10025419-1683782395-02-n4cpu4-0001> I230511 05:37:35.070249 199814 kv/kvserver/replica_command.go:412 ⋮ [T1,n1,split,s1,r64/1:‹/{Table/106-Max}›] 208 initiating a split of this range at key /Table/106/1/‹522779273917878259› [r65] (‹load at key /Table/106/1/522779273917878259 (cpu 1.2s, 1282.11 batches/sec, 51.08 raft mutations/sec)›)‹›
teamcity-10025419-1683782395-02-n4cpu4-0001> I230511 05:37:46.125156 299227 kv/kvserver/replica_command.go:412 ⋮ [T1,n1,split,s1,r65/1:‹/{Table/106/1/…-Max}›] 217 initiating a split of this range at key /Table/106/2 [r66] (‹load at key /Table/106/2 (cpu 750ms, 872.89 batches/sec, 27.15 raft mutations/sec)›)‹›
teamcity-10025419-1683782395-02-n4cpu4-0001> I230511 05:37:46.135101 299050 kv/kvserver/replica_command.go:412 ⋮ [T1,n1,split,s1,r64/1:‹/Table/106{-/1/5227…}›] 218 initiating a split of this range at key /Table/106/1/‹-1908700725658152825› [r67] (‹load at key /Table/106/1/-1908700725658152825 (cpu 894ms, 959.78 batches/sec, 28.96 raft mutations/sec)›)‹›
teamcity-10025419-1683782395-02-n4cpu4-0003> I230511 05:38:08.866451 229216 kv/kvserver/replica_command.go:412 ⋮ [T1,n3,split,s3,r64/4:‹/Table/106{-/1/-190…}›] 139 initiating a split of this range at key /Table/106/1/‹-6202310547831847912› [r84] (‹load at key /Table/106/1/-6202310547831847912 (cpu 1.2s, 1610.52 batches/sec, 28.25 raft mutations/sec)›)‹›
teamcity-10025419-1683782395-02-n4cpu4-0002> I230511 05:38:20.871923 265695 kv/kvserver/replica_command.go:412 ⋮ [T1,n2,split,s2,r67/2:‹/Table/106/1/{-19087…-522779…}›] 178 initiating a split of this range at key /Table/106/1/‹-133697430637084912› [r94] (‹load at key /Table/106/1/-133697430637084912 (cpu 487ms, 1355.69 batches/sec, 7.09 raft mutations/sec)›)‹›

@kvoli could you take a look at what's expected here? My intuition is that there's enough randomness in this test to occasionally see a couple more splits than hard-coded in the test.

kvoli commented 1 year ago

I got the split values experimentally over a hundred runs of this test. I bumped the bounds slightly from the max/min of those samples.

You're right there probably is enough randomness that more splits could occur - between the workload runner selecting start keys, weighted split finder resovoir sampling and CPU usage for the same span request being different.

The goal of the test is to prevent against regressions where there are an outlandish (amplifying) number of splits, or no splits at all.

In this case there were 5 splits which seems fine and also correct given the error message being logged.

has 6 ranges, expected between 2 and 5 splits

I'll open a PR to bump the expected range higher and fix the error message.