Closed cockroach-teamcity closed 2 years ago
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 8d34ef1ea15850ee1c70470610b6652df4c317de:
| 664.0s 0 1.0 18.3 13421.8 13421.8 13421.8 13421.8 stockLevel
| 665.0s 0 0.0 18.4 0.0 0.0 0.0 0.0 delivery
| 665.0s 0 4.0 184.8 17179.9 19327.4 19327.4 19327.4 newOrder
| 665.0s 0 0.0 18.4 0.0 0.0 0.0 0.0 orderStatus
| 665.0s 0 0.0 182.9 0.0 0.0 0.0 0.0 payment
| 665.0s 0 0.0 18.3 0.0 0.0 0.0 0.0 stockLevel
| 666.0s 0 0.0 18.3 0.0 0.0 0.0 0.0 delivery
| 666.0s 0 0.0 184.5 0.0 0.0 0.0 0.0 newOrder
| 666.0s 0 0.0 18.4 0.0 0.0 0.0 0.0 orderStatus
| 666.0s 0 0.0 182.6 0.0 0.0 0.0 0.0 payment
| 666.0s 0 0.0 18.3 0.0 0.0 0.0 0.0 stockLevel
Wraps: (8) COMMAND_PROBLEM
Wraps: (9) Node 5. Command with error:
| ``````
| ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json --ramp=5m0s --duration=2h0m0s --prometheus-port=2112 --pprofport=33333 {pgurl:1-4}
| ``````
Wraps: (10) exit status 1
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError
mixed_version_jobs.go:73,versionupgrade.go:178,tpcc.go:427,test_runner.go:897: monitor failure: monitor task failed: t.Fatal() was called
(1) attached stack trace
-- stack trace:
| main.(*monitorImpl).WaitE
| main/pkg/cmd/roachtest/monitor.go:115
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*backgroundStepper).wait
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/mixed_version_jobs.go:69
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:178
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:427
| main.(*testRunner).runTest.func2
| main/pkg/cmd/roachtest/test_runner.go:897
Wraps: (2) monitor failure
Wraps: (3) attached stack trace
-- stack trace:
| main.(*monitorImpl).wait.func2
| main/pkg/cmd/roachtest/monitor.go:171
Wraps: (4) monitor task failed
Wraps: (5) attached stack trace
-- stack trace:
| main.init
| main/pkg/cmd/roachtest/monitor.go:80
| runtime.doInit
| GOROOT/src/runtime/proc.go:6498
| runtime.main
| GOROOT/src/runtime/proc.go:238
| runtime.goexit
| GOROOT/src/runtime/asm_amd64.s:1581
Wraps: (6) t.Fatal() was called
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.2]
node 3 OOMed (node 2 on the second failure):
[ 7780.081918] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cockroach.service,task=cockroach,pid=2867347,uid=1000
[ 7780.082013] Out of memory: Killed process 2867347 (cockroach) total-vm:21357384kB, anon-rss:13339668kB, file-rss:1236kB, shmem-rss:0kB, UID:1000 pgtables:38656kB oom_score_adj:0
[ 7780.734170] oom_reaper: reaped process 2867347 (cockroach), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 13cb2f6c40e3146fed8d931f65f89da9b42ce2c1:
| 17.0s 0 0.0 8.2 0.0 0.0 0.0 0.0 stockLevel
| 18.0s 0 3.0 6.9 8053.1 10737.4 10737.4 10737.4 delivery
| 18.0s 0 82.1 65.6 10737.4 13958.6 15569.3 18253.6 newOrder
| 18.0s 0 5.0 8.0 6442.5 7247.8 7247.8 7247.8 orderStatus
| 18.0s 0 48.1 74.9 10737.4 11811.2 12884.9 12884.9 payment
| 18.0s 0 0.0 7.8 0.0 0.0 0.0 0.0 stockLevel
| 19.0s 0 5.0 6.8 10200.5 10200.5 10200.5 10200.5 delivery
| 19.0s 0 22.0 63.3 11811.2 14495.5 15032.4 15032.4 newOrder
| 19.0s 0 0.0 7.6 0.0 0.0 0.0 0.0 orderStatus
| 19.0s 0 52.0 73.7 10737.4 11811.2 11811.2 12348.0 payment
| 19.0s 0 0.0 7.4 0.0 0.0 0.0 0.0 stockLevel
Wraps: (8) COMMAND_PROBLEM
Wraps: (9) Node 5. Command with error:
| ``````
| ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json --ramp=5m0s --duration=2h0m0s --prometheus-port=2112 --pprofport=33333 {pgurl:1-4}
| ``````
Wraps: (10) exit status 1
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError
mixed_version_jobs.go:73,versionupgrade.go:178,tpcc.go:427,test_runner.go:896: monitor failure: monitor task failed: t.Fatal() was called
(1) attached stack trace
-- stack trace:
| main.(*monitorImpl).WaitE
| main/pkg/cmd/roachtest/monitor.go:115
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*backgroundStepper).wait
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/mixed_version_jobs.go:69
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:178
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:427
| main.(*testRunner).runTest.func2
| main/pkg/cmd/roachtest/test_runner.go:896
Wraps: (2) monitor failure
Wraps: (3) attached stack trace
-- stack trace:
| main.(*monitorImpl).wait.func2
| main/pkg/cmd/roachtest/monitor.go:171
Wraps: (4) monitor task failed
Wraps: (5) attached stack trace
-- stack trace:
| main.init
| main/pkg/cmd/roachtest/monitor.go:80
| runtime.doInit
| GOROOT/src/runtime/proc.go:6498
| runtime.main
| GOROOT/src/runtime/proc.go:238
| runtime.goexit
| GOROOT/src/runtime/asm_amd64.s:1581
Wraps: (6) t.Fatal() was called
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.2]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 457d724622e4fa2e62d6f4e7926509dbc7d18511:
| 785.0s 0 0.0 18.9 0.0 0.0 0.0 0.0 stockLevel
| 786.0s 0 0.0 18.8 0.0 0.0 0.0 0.0 delivery
| 786.0s 0 0.0 189.1 0.0 0.0 0.0 0.0 newOrder
| 786.0s 0 0.0 18.9 0.0 0.0 0.0 0.0 orderStatus
| 786.0s 0 0.0 188.5 0.0 0.0 0.0 0.0 payment
| 786.0s 0 0.0 18.9 0.0 0.0 0.0 0.0 stockLevel
| 787.0s 0 0.0 18.8 0.0 0.0 0.0 0.0 delivery
| 787.0s 0 0.0 188.8 0.0 0.0 0.0 0.0 newOrder
| 787.0s 0 0.0 18.9 0.0 0.0 0.0 0.0 orderStatus
| 787.0s 0 0.0 188.2 0.0 0.0 0.0 0.0 payment
| 787.0s 0 0.0 18.9 0.0 0.0 0.0 0.0 stockLevel
Wraps: (8) COMMAND_PROBLEM
Wraps: (9) Node 5. Command with error:
| ``````
| ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json --ramp=5m0s --duration=2h0m0s --prometheus-port=0 --pprofport=33333 {pgurl:1-4}
| ``````
Wraps: (10) exit status 1
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError
mixed_version_jobs.go:73,versionupgrade.go:188,tpcc.go:433,test_runner.go:896: monitor failure: monitor task failed: t.Fatal() was called
(1) attached stack trace
-- stack trace:
| main.(*monitorImpl).WaitE
| main/pkg/cmd/roachtest/monitor.go:115
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*backgroundStepper).wait
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/mixed_version_jobs.go:69
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*versionUpgradeTest).run
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/versionupgrade.go:188
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerTPCC.func2
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/tpcc.go:433
| main.(*testRunner).runTest.func2
| main/pkg/cmd/roachtest/test_runner.go:896
Wraps: (2) monitor failure
Wraps: (3) attached stack trace
-- stack trace:
| main.(*monitorImpl).wait.func2
| main/pkg/cmd/roachtest/monitor.go:171
Wraps: (4) monitor task failed
Wraps: (5) attached stack trace
-- stack trace:
| main.init
| main/pkg/cmd/roachtest/monitor.go:80
| runtime.doInit
| GOROOT/src/runtime/proc.go:6498
| runtime.main
| GOROOT/src/runtime/proc.go:238
| runtime.goexit
| GOROOT/src/runtime/asm_amd64.s:1581
Wraps: (6) t.Fatal() was called
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.2]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 773f7d4445ce3e0e806b7a182adba70a0f270f19:
| 298.0s 0 168.0 93.8 31.5 44.0 48.2 60.8 newOrder
| 298.0s 0 13.0 10.1 6.0 9.4 10.5 10.5 orderStatus
| 298.0s 0 201.0 100.4 18.9 32.5 50.3 56.6 payment
| 298.0s 0 15.0 10.0 27.3 65.0 92.3 92.3 stockLevel
| 299.0s 0 12.0 10.0 58.7 62.9 62.9 62.9 delivery
| 299.0s 0 203.8 94.2 33.6 65.0 79.7 83.9 newOrder
| 299.0s 0 27.0 10.2 6.8 8.1 13.6 13.6 orderStatus
| 299.0s 0 214.8 100.8 21.0 50.3 67.1 75.5 payment
| 299.0s 0 24.0 10.0 33.6 62.9 88.1 88.1 stockLevel
| 300.0s 0 18.0 10.0 60.8 92.3 100.7 100.7 delivery
| 300.0s 0 195.1 94.5 32.5 41.9 50.3 52.4 newOrder
| 300.0s 0 20.0 10.2 6.0 7.1 7.1 7.1 orderStatus
| 300.0s 0 174.1 101.0 21.0 32.5 39.8 46.1 payment
| 300.0s 0 15.0 10.1 26.2 37.7 50.3 50.3 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 1.0s 0 20.1 20.1 60.8 79.7 109.1 109.1 delivery
| 1.0s 0 211.3 211.4 35.7 52.4 62.9 67.1 newOrder
| 1.0s 0 26.2 26.2 6.6 8.9 8.9 8.9 orderStatus
| 1.0s 0 166.0 166.1 21.0 33.6 41.9 48.2 payment
| 1.0s 0 15.1 15.1 26.2 44.0 46.1 46.1 stockLevel
| 2.0s 0 20.0 20.1 58.7 67.1 79.7 79.7 delivery
| 2.0s 0 166.0 188.6 31.5 44.0 46.1 54.5 newOrder
| 2.0s 0 14.0 20.1 6.0 7.6 8.9 8.9 orderStatus
| 2.0s 0 214.0 190.1 19.9 28.3 33.6 33.6 payment
| 2.0s 0 17.0 16.1 32.5 48.2 71.3 71.3 stockLevel
| 3.0s 0 23.0 21.0 58.7 83.9 83.9 83.9 delivery
| 3.0s 0 175.1 184.1 32.5 39.8 46.1 46.1 newOrder
| 3.0s 0 12.0 17.4 6.3 7.6 8.1 8.1 orderStatus
| 3.0s 0 214.1 198.1 21.0 29.4 46.1 58.7 payment
| 3.0s 0 19.0 17.0 31.5 48.2 54.5 54.5 stockLevel
| 4.0s 0 14.0 19.3 60.8 88.1 92.3 92.3 delivery
| 4.0s 0 220.8 193.3 33.6 52.4 65.0 83.9 newOrder
| 4.0s 0 20.0 18.0 6.6 8.9 10.5 10.5 orderStatus
| 4.0s 0 168.9 190.8 22.0 39.8 50.3 62.9 payment
| 4.0s 0 22.0 18.3 33.6 46.1 62.9 62.9 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 5.0s 0 15.0 18.4 79.7 96.5 117.4 117.4 delivery
| 5.0s 0 196.1 193.9 35.7 52.4 71.3 83.9 newOrder
| 5.0s 0 11.0 16.6 7.6 11.0 11.5 11.5 orderStatus
| 5.0s 0 164.1 185.4 23.1 46.1 65.0 65.0 payment
| 5.0s 0 10.0 16.6 41.9 50.3 50.3 50.3 stockLevel
Wraps: (8) COMMAND_PROBLEM
Wraps: (9) Node 5. Command with error:
| ``````
| ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json --ramp=5m0s --duration=2h0m0s --prometheus-port=0 --pprofport=33333 {pgurl:1-4}
| ``````
Wraps: (10) exit status 1
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError
versionupgrade.go:502,versionupgrade.go:188,tpcc.go:433,test_runner.go:896: context canceled
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.2]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ f4042d47fa8062a612c38d4696eb6bee9cee7c21:
| 255.0s 0 150.9 87.0 19.9 31.5 41.9 46.1 payment
| 255.0s 0 19.0 8.8 25.2 48.2 52.4 52.4 stockLevel
| 256.0s 0 19.0 8.6 65.0 83.9 83.9 83.9 delivery
| 256.0s 0 196.1 80.1 35.7 46.1 50.3 56.6 newOrder
| 256.0s 0 18.0 8.9 7.1 11.0 15.7 15.7 orderStatus
| 256.0s 0 158.1 87.3 21.0 31.5 35.7 41.9 payment
| 256.0s 0 20.0 8.8 26.2 41.9 56.6 56.6 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 257.0s 0 21.0 8.6 56.6 65.0 67.1 67.1 delivery
| 257.0s 0 159.0 80.4 33.6 44.0 44.0 48.2 newOrder
| 257.0s 0 17.0 8.9 7.6 8.9 9.4 9.4 orderStatus
| 257.0s 0 177.0 87.7 22.0 32.5 35.7 35.7 payment
| 257.0s 0 14.0 8.9 26.2 41.9 46.1 46.1 stockLevel
| 258.0s 0 15.0 8.6 58.7 65.0 75.5 75.5 delivery
| 258.0s 0 159.0 80.7 33.6 44.0 50.3 56.6 newOrder
| 258.0s 0 18.0 8.9 6.8 8.9 9.4 9.4 orderStatus
| 258.0s 0 163.0 88.0 19.9 28.3 33.6 35.7 payment
| 258.0s 0 17.0 8.9 25.2 46.1 46.1 46.1 stockLevel
| 259.0s 0 14.0 8.7 58.7 65.0 71.3 71.3 delivery
| 259.0s 0 154.8 81.0 33.6 41.9 46.1 54.5 newOrder
| 259.0s 0 18.0 9.0 6.0 10.5 13.1 13.1 orderStatus
| 259.0s 0 156.8 88.2 21.0 28.3 30.4 44.0 payment
| 259.0s 0 15.0 8.9 33.6 52.4 62.9 62.9 stockLevel
| 260.0s 0 15.0 8.7 62.9 79.7 151.0 151.0 delivery
| 260.0s 0 163.1 81.3 35.7 44.0 48.2 52.4 newOrder
| 260.0s 0 15.0 9.0 6.6 8.9 10.0 10.0 orderStatus
| 260.0s 0 164.1 88.5 21.0 29.4 33.6 35.7 payment
| 260.0s 0 21.0 9.0 31.5 48.2 60.8 60.8 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 261.0s 0 12.0 8.7 58.7 65.0 71.3 71.3 delivery
| 261.0s 0 165.9 81.6 35.7 44.0 54.5 58.7 newOrder
| 261.0s 0 20.0 9.0 7.3 8.1 12.1 12.1 orderStatus
| 261.0s 0 184.9 88.9 22.0 30.4 37.7 50.3 payment
| 261.0s 0 30.0 9.0 26.2 46.1 48.2 48.2 stockLevel
| 262.0s 0 22.0 8.8 58.7 79.7 83.9 83.9 delivery
| 262.0s 0 159.1 81.9 35.7 48.2 54.5 54.5 newOrder
| 262.0s 0 19.0 9.1 6.8 8.1 8.4 8.4 orderStatus
| 262.0s 0 162.1 89.2 21.0 27.3 29.4 35.7 payment
| 262.0s 0 13.0 9.1 24.1 39.8 48.2 48.2 stockLevel
Wraps: (8) secondary error attachment
| UNCLASSIFIED_PROBLEM: context canceled
| (1) UNCLASSIFIED_PROBLEM
| Wraps: (2) Node 5. Command with error:
| | ``````
| | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json --ramp=5m0s --duration=2h0m0s --prometheus-port=0 --pprofport=33333 {pgurl:1-4}
| | ``````
| Wraps: (3) context canceled
| Error types: (1) errors.Unclassified (2) *hintdetail.withDetail (3) *errors.errorString
Wraps: (9) context canceled
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) *secondary.withSecondaryError (9) *errors.errorString
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.2]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ a0d8839aa6164af81a9ebb140147d3baf5321287:
| 73.0s 0 4.0 3.2 8.9 10.0 10.0 10.0 orderStatus
| 73.0s 0 54.0 31.2 18.9 50.3 52.4 52.4 payment
| 73.0s 0 5.0 3.2 33.6 48.2 48.2 48.2 stockLevel
| 74.0s 0 4.0 3.5 71.3 88.1 88.1 88.1 delivery
| 74.0s 0 40.0 20.2 32.5 52.4 60.8 60.8 newOrder
| 74.0s 0 5.0 3.2 8.9 10.5 10.5 10.5 orderStatus
| 74.0s 0 48.0 31.4 16.3 30.4 37.7 37.7 payment
| 74.0s 0 5.0 3.2 37.7 41.9 41.9 41.9 stockLevel
| 75.0s 0 5.0 3.5 96.5 167.8 167.8 167.8 delivery
| 75.0s 0 46.0 20.6 48.2 92.3 109.1 109.1 newOrder
| 75.0s 0 4.0 3.2 9.4 10.5 10.5 10.5 orderStatus
| 75.0s 0 58.0 31.7 24.1 60.8 67.1 67.1 payment
| 75.0s 0 8.0 3.3 33.6 79.7 79.7 79.7 stockLevel
| 76.0s 0 5.0 3.5 83.9 121.6 121.6 121.6 delivery
| 76.0s 0 39.0 20.8 52.4 75.5 92.3 92.3 newOrder
| 76.0s 0 7.0 3.2 7.9 24.1 24.1 24.1 orderStatus
| 76.0s 0 61.0 32.1 18.9 60.8 62.9 62.9 payment
| 76.0s 0 5.0 3.3 33.6 41.9 41.9 41.9 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 77.0s 0 5.0 3.5 71.3 104.9 104.9 104.9 delivery
| 77.0s 0 46.0 21.1 79.7 134.2 167.8 167.8 newOrder
| 77.0s 0 2.0 3.2 7.6 9.4 9.4 9.4 orderStatus
| 77.0s 0 55.0 32.4 37.7 75.5 88.1 109.1 payment
| 77.0s 0 7.0 3.4 32.5 54.5 54.5 54.5 stockLevel
| 78.0s 0 4.0 3.6 71.3 134.2 134.2 134.2 delivery
| 78.0s 0 45.0 21.4 75.5 117.4 121.6 121.6 newOrder
| 78.0s 0 9.0 3.3 7.6 11.0 11.0 11.0 orderStatus
| 78.0s 0 50.0 32.7 25.2 79.7 104.9 104.9 payment
| 78.0s 0 3.0 3.4 30.4 54.5 54.5 54.5 stockLevel
| 79.0s 0 8.0 3.6 117.4 121.6 121.6 121.6 delivery
| 79.0s 0 45.0 21.7 75.5 121.6 159.4 159.4 newOrder
| 79.0s 0 5.0 3.3 8.9 11.0 11.0 11.0 orderStatus
| 79.0s 0 44.0 32.8 21.0 62.9 88.1 88.1 payment
| 79.0s 0 10.0 3.5 27.3 35.7 35.7 35.7 stockLevel
| 80.0s 0 6.0 3.6 104.9 117.4 117.4 117.4 delivery
| 80.0s 0 55.0 22.2 67.1 121.6 130.0 159.4 newOrder
| 80.0s 0 4.0 3.3 7.3 12.1 12.1 12.1 orderStatus
| 80.0s 0 64.0 33.2 18.9 79.7 113.2 113.2 payment
| 80.0s 0 5.0 3.5 44.0 58.7 58.7 58.7 stockLevel
Wraps: (8) secondary error attachment
| UNCLASSIFIED_PROBLEM: context canceled
| (1) UNCLASSIFIED_PROBLEM
| Wraps: (2) Node 5. Command with error:
| | ``````
| | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json --ramp=5m0s --duration=2h0m0s --prometheus-port=0 --pprofport=33333 {pgurl:1-4}
| | ``````
| Wraps: (3) context canceled
| Error types: (1) errors.Unclassified (2) *hintdetail.withDetail (3) *errors.errorString
Wraps: (9) context canceled
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) *secondary.withSecondaryError (9) *errors.errorString
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.2]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ aaf50e920ceff3c2863ab96b9e3614b8434b70a8:
| 285.0s 0 12.0 9.7 7.3 7.9 16.8 16.8 orderStatus
| 285.0s 0 181.8 96.5 19.9 41.9 48.2 65.0 payment
| 285.0s 0 25.0 9.6 24.1 48.2 60.8 60.8 stockLevel
| 286.0s 0 11.0 9.6 60.8 79.7 92.3 92.3 delivery
| 286.0s 0 196.2 90.0 32.5 48.2 56.6 67.1 newOrder
| 286.0s 0 25.0 9.7 7.1 9.4 9.4 9.4 orderStatus
| 286.0s 0 192.2 96.8 19.9 30.4 41.9 52.4 payment
| 286.0s 0 24.0 9.7 24.1 50.3 54.5 54.5 stockLevel
| 287.0s 0 20.0 9.6 58.7 71.3 92.3 92.3 delivery
| 287.0s 0 172.9 90.3 32.5 52.4 58.7 92.3 newOrder
| 287.0s 0 18.0 9.7 6.3 7.9 8.4 8.4 orderStatus
| 287.0s 0 171.9 97.1 18.9 27.3 37.7 39.8 payment
| 287.0s 0 21.0 9.7 24.1 54.5 58.7 58.7 stockLevel
| 288.0s 0 14.0 9.6 56.6 71.3 75.5 75.5 delivery
| 288.0s 0 193.0 90.7 32.5 46.1 56.6 60.8 newOrder
| 288.0s 0 18.0 9.8 6.8 8.9 9.4 9.4 orderStatus
| 288.0s 0 193.0 97.4 18.9 26.2 28.3 30.4 payment
| 288.0s 0 18.0 9.7 25.2 44.0 44.0 44.0 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 289.0s 0 16.0 9.6 65.0 71.3 71.3 71.3 delivery
| 289.0s 0 173.0 91.0 31.5 41.9 50.3 79.7 newOrder
| 289.0s 0 17.0 9.8 7.6 9.4 10.5 10.5 orderStatus
| 289.0s 0 213.0 97.8 19.9 31.5 44.0 50.3 payment
| 289.0s 0 22.0 9.8 25.2 48.2 48.2 48.2 stockLevel
| 290.0s 0 18.0 9.7 60.8 71.3 75.5 75.5 delivery
| 290.0s 0 183.0 91.3 32.5 50.3 60.8 65.0 newOrder
| 290.0s 0 20.0 9.8 6.3 8.1 8.1 8.1 orderStatus
| 290.0s 0 195.0 98.1 19.9 32.5 46.1 52.4 payment
| 290.0s 0 10.0 9.8 27.3 44.0 44.0 44.0 stockLevel
| 291.0s 0 20.0 9.7 60.8 83.9 96.5 96.5 delivery
| 291.0s 0 172.0 91.6 31.5 46.1 54.5 58.7 newOrder
| 291.0s 0 22.0 9.9 7.3 12.1 12.6 12.6 orderStatus
| 291.0s 0 173.0 98.4 19.9 28.3 37.7 41.9 payment
| 291.0s 0 17.0 9.8 21.0 56.6 60.8 60.8 stockLevel
| 292.0s 0 20.0 9.7 60.8 75.5 109.1 109.1 delivery
| 292.0s 0 186.9 91.9 31.5 44.0 50.3 50.3 newOrder
| 292.0s 0 14.0 9.9 6.3 8.4 12.1 12.1 orderStatus
| 292.0s 0 188.9 98.7 19.9 29.4 33.6 37.7 payment
| 292.0s 0 23.0 9.8 22.0 48.2 48.2 48.2 stockLevel
Wraps: (8) secondary error attachment
| UNCLASSIFIED_PROBLEM: context canceled
| (1) UNCLASSIFIED_PROBLEM
| Wraps: (2) Node 5. Command with error:
| | ``````
| | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json --ramp=5m0s --duration=2h0m0s --prometheus-port=0 --pprofport=33333 {pgurl:1-4}
| | ``````
| Wraps: (3) context canceled
| Error types: (1) errors.Unclassified (2) *hintdetail.withDetail (3) *errors.errorString
Wraps: (9) context canceled
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) *secondary.withSecondaryError (9) *errors.errorString
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.2]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 80c274877a917580af62be6eb0cd48c8c7ae9c08:
| 94.0s 0 169.8 184.4 16.8 23.1 26.2 27.3 payment
| 94.0s 0 15.0 18.3 18.9 35.7 37.7 37.7 stockLevel
| 95.0s 0 16.0 18.8 56.6 62.9 65.0 65.0 delivery
| 95.0s 0 202.0 194.8 28.3 44.0 48.2 56.6 newOrder
| 95.0s 0 32.0 18.6 6.8 9.4 10.0 10.0 orderStatus
| 95.0s 0 214.0 184.7 16.3 24.1 32.5 35.7 payment
| 95.0s 0 22.0 18.4 19.9 35.7 37.7 37.7 stockLevel
| 96.0s 0 31.0 18.9 54.5 201.3 209.7 209.7 delivery
| 96.0s 0 200.2 194.9 32.5 151.0 184.5 192.9 newOrder
| 96.0s 0 21.0 18.6 6.6 10.5 10.5 10.5 orderStatus
| 96.0s 0 186.2 184.8 18.9 104.9 130.0 176.2 payment
| 96.0s 0 20.0 18.4 15.7 41.9 46.1 46.1 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 97.0s 0 25.0 19.0 52.4 62.9 83.9 83.9 delivery
| 97.0s 0 214.0 195.1 29.4 39.8 44.0 48.2 newOrder
| 97.0s 0 15.0 18.6 6.6 7.9 8.9 8.9 orderStatus
| 97.0s 0 190.0 184.8 17.8 26.2 28.3 31.5 payment
| 97.0s 0 22.0 18.4 18.9 44.0 54.5 54.5 stockLevel
| 98.0s 0 18.0 19.0 52.4 56.6 58.7 58.7 delivery
| 98.0s 0 190.9 195.0 29.4 37.7 44.0 46.1 newOrder
| 98.0s 0 13.0 18.5 6.0 7.9 7.9 7.9 orderStatus
| 98.0s 0 192.9 184.9 16.3 23.1 25.2 27.3 payment
| 98.0s 0 17.0 18.4 23.1 37.7 41.9 41.9 stockLevel
| 99.0s 0 17.0 19.0 54.5 56.6 96.5 96.5 delivery
| 99.0s 0 187.1 194.9 27.3 37.7 50.3 52.4 newOrder
| 99.0s 0 30.0 18.7 6.3 9.4 10.0 10.0 orderStatus
| 99.0s 0 205.1 185.1 16.3 23.1 30.4 35.7 payment
| 99.0s 0 13.0 18.4 19.9 37.7 44.0 44.0 stockLevel
| 100.0s 0 32.0 19.1 54.5 65.0 67.1 67.1 delivery
| 100.0s 0 196.0 195.0 29.4 44.0 58.7 58.7 newOrder
| 100.0s 0 16.0 18.6 5.5 10.0 12.1 12.1 orderStatus
| 100.0s 0 184.0 185.1 17.8 24.1 39.8 44.0 payment
| 100.0s 0 19.0 18.4 17.8 27.3 37.7 37.7 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 101.0s 0 21.0 19.1 54.5 71.3 79.7 79.7 delivery
| 101.0s 0 183.0 194.8 29.4 44.0 52.4 56.6 newOrder
| 101.0s 0 21.0 18.7 6.3 8.1 8.4 8.4 orderStatus
| 101.0s 0 174.0 185.0 17.8 29.4 35.7 39.8 payment
| 101.0s 0 15.0 18.3 16.8 22.0 39.8 39.8 stockLevel
Wraps: (8) secondary error attachment
| UNCLASSIFIED_PROBLEM: context canceled
| (1) UNCLASSIFIED_PROBLEM
| Wraps: (2) Node 5. Command with error:
| | ``````
| | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json --ramp=5m0s --duration=2h0m0s --prometheus-port=0 --pprofport=33333 {pgurl:1-4}
| | ``````
| Wraps: (3) context canceled
| Error types: (1) errors.Unclassified (2) *hintdetail.withDetail (3) *errors.errorString
Wraps: (9) context canceled
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) *secondary.withSecondaryError (9) *errors.errorString
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.2]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 524fd14da3fefcd849f44a835cc5f88f5dbdadcc:
| 286.0s 0 184.0 97.3 21.0 35.7 39.8 48.2 payment
| 286.0s 0 20.0 9.5 29.4 52.4 65.0 65.0 stockLevel
| 287.0s 0 13.0 9.6 60.8 62.9 65.0 65.0 delivery
| 287.0s 0 179.0 89.9 32.5 44.0 50.3 58.7 newOrder
| 287.0s 0 16.0 9.6 7.6 8.4 10.0 10.0 orderStatus
| 287.0s 0 183.0 97.6 19.9 30.4 37.7 41.9 payment
| 287.0s 0 15.0 9.6 26.2 41.9 52.4 52.4 stockLevel
| 288.0s 0 9.0 9.5 67.1 79.7 79.7 79.7 delivery
| 288.0s 0 174.0 90.2 35.7 60.8 75.5 83.9 newOrder
| 288.0s 0 18.0 9.7 6.3 8.9 39.8 39.8 orderStatus
| 288.0s 0 183.0 97.9 21.0 41.9 60.8 67.1 payment
| 288.0s 0 24.0 9.6 24.1 41.9 92.3 92.3 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 289.0s 0 26.0 9.6 60.8 88.1 100.7 100.7 delivery
| 289.0s 0 170.0 90.5 32.5 44.0 56.6 62.9 newOrder
| 289.0s 0 19.0 9.7 6.6 9.4 10.0 10.0 orderStatus
| 289.0s 0 186.0 98.2 21.0 30.4 37.7 39.8 payment
| 289.0s 0 25.0 9.7 24.1 50.3 75.5 75.5 stockLevel
| 290.0s 0 22.0 9.6 60.8 67.1 75.5 75.5 delivery
| 290.0s 0 186.0 90.8 32.5 52.4 58.7 60.8 newOrder
| 290.0s 0 12.0 9.7 6.0 8.9 11.0 11.0 orderStatus
| 290.0s 0 191.0 98.5 19.9 28.3 39.8 41.9 payment
| 290.0s 0 26.0 9.7 30.4 48.2 79.7 79.7 stockLevel
| 291.0s 0 21.0 9.7 62.9 92.3 96.5 96.5 delivery
| 291.0s 0 214.8 91.2 35.7 50.3 71.3 75.5 newOrder
| 291.0s 0 22.0 9.7 6.6 10.0 10.5 10.5 orderStatus
| 291.0s 0 172.8 98.8 22.0 32.5 44.0 60.8 payment
| 291.0s 0 20.0 9.8 27.3 48.2 52.4 52.4 stockLevel
| 292.0s 0 16.0 9.7 60.8 88.1 92.3 92.3 delivery
| 292.0s 0 189.0 91.6 31.5 44.0 58.7 60.8 newOrder
| 292.0s 0 17.0 9.8 6.8 10.5 16.3 16.3 orderStatus
| 292.0s 0 158.0 99.0 19.9 35.7 41.9 46.1 payment
| 292.0s 0 16.0 9.8 27.3 50.3 54.5 54.5 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 293.0s 0 27.0 9.8 60.8 88.1 96.5 96.5 delivery
| 293.0s 0 193.1 91.9 33.6 50.3 71.3 113.2 newOrder
| 293.0s 0 23.0 9.8 6.3 10.5 13.6 13.6 orderStatus
| 293.0s 0 198.1 99.3 22.0 39.8 56.6 65.0 payment
| 293.0s 0 18.0 9.8 23.1 56.6 65.0 65.0 stockLevel
Wraps: (8) secondary error attachment
| UNCLASSIFIED_PROBLEM: context canceled
| (1) UNCLASSIFIED_PROBLEM
| Wraps: (2) Node 5. Command with error:
| | ``````
| | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json --ramp=5m0s --duration=2h0m0s --prometheus-port=0 --pprofport=33333 {pgurl:1-4}
| | ``````
| Wraps: (3) context canceled
| Error types: (1) errors.Unclassified (2) *hintdetail.withDetail (3) *errors.errorString
Wraps: (9) context canceled
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) *secondary.withSecondaryError (9) *errors.errorString
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.2]
Artifacts are missing from all but the last one. In the last failure, we see node 1 exit with status code 1 ("unspecified failure"). I can't find anything else in the logs about why the process exited. It doesn't appear to be OOM related, but maybe I'm missing the signs. The last thing in the log is:
I220825 15:35:46.188812 48480 upgrade/upgradecluster/cluster.go:118 ⋮ [n1,intExec=‹×›,migration-mgr] 826 executing bump-cluster-version=22.1-48 on nodes n{1,2,3,4}
I'll try to reproduce using:
GCE_PROJECT=andrei-jepsen ./pkg/cmd/roachtest/roachstress.sh -c10 -u 'tpcc/mixed-headroom/n5cpu16' -- --cpu-quota=1280
5 of those 10 runs failed, so this is reproducible. At least two failed due to an OOM.
The OOM occurred during the bank import step of the roachtest. At that time, the node which OOMed was seeing many slow raft ready
iterations and appears to have been overloaded.
However, the last heap profile doesn't show anything particularly interesting:
(pprof) top
Showing nodes accounting for 779.35MB, 90.53% of 860.82MB total
Dropped 497 nodes (cum <= 4.30MB)
Showing top 10 nodes out of 140
flat flat% sum% cum cum%
173.50MB 20.16% 20.16% 173.50MB 20.16% github.com/cockroachdb/cockroach/pkg/col/coldata.(*element).setNonInlined
142.38MB 16.54% 36.70% 142.38MB 16.54% go.etcd.io/etcd/raft/v3/raftpb.(*Entry).Unmarshal
137.63MB 15.99% 52.68% 137.63MB 15.99% github.com/cockroachdb/cockroach/pkg/kv/kvserver/kvserverpb.(*ReplicatedEvalResult_AddSSTable).Unmarshal
128.14MB 14.89% 67.57% 128.14MB 14.89% github.com/cockroachdb/cockroach/pkg/kv/bulk.(*kvBuf).fits
97.50MB 11.33% 78.90% 97.50MB 11.33% github.com/cockroachdb/cockroach/pkg/roachpb.(*Value).ensureRawBytes
Could be #73376, which keeps popping up. Unfortunately we may not get around to addressing it for 23.1, but we're considering bumping the priority.
I was thinking along the same lines, but I also notice a clear inflection point in the rate of failures here, so something regressed about 17 days ago. I'm going to see if a bisect will lead to greater clarity.
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ e39111b2e714375faa0facc05e51e8f619a55b21:
| 283.0s 0 186.1 96.4 13.6 25.2 32.5 41.9 payment
| 283.0s 0 16.0 9.5 18.9 26.2 29.4 29.4 stockLevel
| 284.0s 0 14.0 9.5 50.3 62.9 65.0 65.0 delivery
| 284.0s 0 164.8 89.5 24.1 30.4 33.6 39.8 newOrder
| 284.0s 0 13.0 9.6 7.1 8.9 10.0 10.0 orderStatus
| 284.0s 0 182.8 96.7 13.1 16.3 22.0 24.1 payment
| 284.0s 0 14.0 9.5 17.8 23.1 28.3 28.3 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 285.0s 0 22.0 9.6 52.4 67.1 67.1 67.1 delivery
| 285.0s 0 189.1 89.8 24.1 31.5 41.9 71.3 newOrder
| 285.0s 0 16.0 9.6 6.3 7.9 8.9 8.9 orderStatus
| 285.0s 0 193.1 97.0 13.6 21.0 41.9 62.9 payment
| 285.0s 0 19.0 9.6 17.8 22.0 25.2 25.2 stockLevel
| 286.0s 0 16.0 9.6 54.5 83.9 88.1 88.1 delivery
| 286.0s 0 170.1 90.1 25.2 52.4 60.8 67.1 newOrder
| 286.0s 0 14.0 9.6 6.6 8.1 14.2 14.2 orderStatus
| 286.0s 0 193.1 97.3 13.6 29.4 44.0 58.7 payment
| 286.0s 0 18.0 9.6 15.7 23.1 25.2 25.2 stockLevel
| 287.0s 0 11.0 9.6 54.5 62.9 65.0 65.0 delivery
| 287.0s 0 192.0 90.5 24.1 28.3 33.6 37.7 newOrder
| 287.0s 0 19.0 9.7 6.8 8.1 8.4 8.4 orderStatus
| 287.0s 0 176.0 97.6 13.1 15.7 21.0 31.5 payment
| 287.0s 0 15.0 9.6 16.8 23.1 26.2 26.2 stockLevel
| 288.0s 0 20.0 9.7 54.5 67.1 75.5 75.5 delivery
| 288.0s 0 181.1 90.8 24.1 30.4 33.6 37.7 newOrder
| 288.0s 0 19.0 9.7 6.8 8.9 11.0 11.0 orderStatus
| 288.0s 0 176.1 97.9 13.6 17.8 22.0 24.1 payment
| 288.0s 0 25.0 9.7 18.9 24.1 28.3 28.3 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 289.0s 0 18.0 9.7 56.6 67.1 71.3 71.3 delivery
| 289.0s 0 173.0 91.1 25.2 37.7 58.7 65.0 newOrder
| 289.0s 0 17.0 9.7 6.0 8.9 13.1 13.1 orderStatus
| 289.0s 0 189.0 98.2 14.2 32.5 46.1 54.5 payment
| 289.0s 0 7.0 9.7 21.0 24.1 24.1 24.1 stockLevel
| 290.0s 0 21.0 9.7 56.6 71.3 75.5 75.5 delivery
| 290.0s 0 210.9 91.5 27.3 46.1 52.4 54.5 newOrder
| 290.0s 0 10.0 9.7 6.6 9.4 9.4 9.4 orderStatus
| 290.0s 0 207.9 98.6 15.2 35.7 44.0 52.4 payment
| 290.0s 0 19.0 9.7 19.9 27.3 27.3 27.3 stockLevel
Wraps: (8) secondary error attachment
| UNCLASSIFIED_PROBLEM: context canceled
| (1) UNCLASSIFIED_PROBLEM
| Wraps: (2) Node 5. Command with error:
| | ``````
| | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json --ramp=5m0s --duration=2h0m0s --prometheus-port=0 --pprofport=33333 {pgurl:1-4}
| | ``````
| Wraps: (3) context canceled
| Error types: (1) errors.Unclassified (2) *hintdetail.withDetail (3) *errors.errorString
Wraps: (9) context canceled
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) *secondary.withSecondaryError (9) *errors.errorString
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.2]
This has not failed with the original failure mode. However, it failed at the same time as a number of other mixed versions tests 2 days ago. Moving that investigation to Test Eng.
The most recent failure seems unrelated to the other mixed versions failures, namely version/mixed/nodes=3
and version/mixed/nodes=5
. (Both failed because of the recent change requiring COCKROACH_UPGRADE_TO_DEV_VERSION
[1].)
Also, this failure doesn't indicate any issue with the upgrade FSM. It appears to be a transient (network) error which causes the background (tpcc) workload to fail thereby failing the test. Thus, I am removing the xxx-blocker
labels. Full analysis is below.
[1] https://github.com/cockroachdb/cockroach/issues/87687#issuecomment-1243866806
From teardown.log
, we can see that the background tpcc workload fails after ~5 minutes,
I220908 17:54:41.085738 1 workload/cli/run.go:427 [-] 1 creating load generator...
I220908 17:54:41.282881 1 workload/cli/run.go:458 [-] 2 creating load generator... done (took 197.141856ms)
I220908 17:59:31.796588 23519 workload/pgx_helpers.go:79 [-] 4 pgx logger [error]: Exec logParams=map[args:[] err:read tcp 10.142.0.10:54240 -> 10.142.0.41:26257: read: connection reset by peer pid:3623803 sql:begin time:143.851154ms]
Note, 10.142.0.41
maps to n3
. Both, n1
and n3
appear to experience transient network availability issues,
for i in `seq 1 4`; do echo "n${i}"; grep "failed to connect to n" logs/$i.unredacted/cockroach.log |tail -1;done
n1
I220908 17:54:10.201923 16855 kv/kvserver/closedts/sidetransport/sender.go:795 ⋮ [n1,ctstream=4] 507 side-transport failed to connect to n4: failed to connect to n4 at ‹10.142.0.21:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.142.0.21:26257: connect: connection refused"›
n2
W220908 17:59:33.250317 669937 2@rpc/nodedialer/nodedialer.go:192 ⋮ [n2] 787 unable to connect to n1: failed to connect to n1 at ‹10.142.0.33:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.142.0.33:26257: connect: connection refused"›
n3
I220908 17:54:10.196929 13357 kv/kvserver/closedts/sidetransport/sender.go:795 ⋮ [n3,ctstream=4] 577 side-transport failed to connect to n4: unable to dial n4: ‹breaker open›
n4
I220908 17:59:33.521021 16084 kv/kvserver/closedts/sidetransport/sender.go:795 ⋮ [n4,ctstream=1] 199 side-transport failed to connect to n1: unable to dial n1: ‹breaker open›
At the time of the workload failure (17:59:31
), all the nodes are in the mixed-version state, executing migration jobs. (In the test harness, this is essentially the final step tpccBackgroundStepper.wait
[1].) From the node logs, we can see that active cluster version is 1000022.1-48
on n2
, n4
and 1000022.1-47
on n1
, n3
,
for i in `seq 1 4`; do echo "n${i}"; grep "active cluster version setting" logs/$i.unredacted/cockroach.log |tail -1;done
n1
I220908 17:59:30.780410 576281 server/migration.go:149 ⋮ [n1,bump-cluster-version] 1138 active cluster version setting is now ‹1000022.1-47(fence)› (up from ‹1000022.1-46›)
n2
I220908 17:59:30.993236 666404 server/migration.go:149 ⋮ [n2,bump-cluster-version] 755 active cluster version setting is now ‹1000022.1-48› (up from ‹1000022.1-47(fence)›)
n3
I220908 17:59:30.780334 716309 server/migration.go:149 ⋮ [n3,bump-cluster-version] 732 active cluster version setting is now ‹1000022.1-47(fence)› (up from ‹1000022.1-46›)
n4
I220908 17:59:31.189051 430132 server/migration.go:149 ⋮ [n4,bump-cluster-version] 159 active cluster version setting is now ‹1000022.1-48› (up from ‹1000022.1-47(fence)›)
The workload failure induced the test failure by invoking t.Fatal
[2] after the monitor detects an error (via WaitE
). As every roachtest failure induces collectClusterArtifacts
, we attempt to grab the logs from every node. However, as can be seen in the teardown.log
, some of the logs could not be transferred successfully. Upon a closer examination, it appears that errors are swallowed inside cluster.Get
[3] (l.File
is non-nil when invoked from roachtest and one of the lines
contains an error message).
teardown: 17:59:35 cluster.go:1118: failed to fetch logs: cluster.Get: get logs failed
Thus, it's technically possible that some of the logs may have been truncated. However, it's highly unlikely that both n1
's and n3
's cockroach.log
got truncated. According to journalctl
, both nodes exit with 1
, at 17:59:31
and 17:59:32
,
Sep 08 17:59:31 teamcity-6383257-1662614354-100-n5cpu16-0003 systemd[1]: cockroach.service: Main process exited, code=exited, status=1/FAILURE
Sep 08 17:59:32 teamcity-6383257-1662614354-100-n5cpu16-0001 systemd[1]: cockroach.service: Main process exited, code=exited, status=1/FAILURE
Note that neither process was killed yet there is no trace of any panic
in the logs. It appears that both nodes exited with UnspecifiedError
. Oddly, the message "Failed running %q\n"
[4] is not in any of the logs. These are the last few messages in cockroach.log
,
tail -5 logs/1.unredacted/cockroach.log
I220908 17:59:30.787980 49890 upgrade/upgradecluster/cluster.go:118 ⋮ [n1,client=35.196.70.170:33426,user=root,migration-mgr] 1142 executing bump-cluster-version=1000022.1-48 on nodes n{1,2,3,4}
I220908 17:59:30.875387 573820 sql/gcjob/gc_job_utils.go:58 ⋮ [n1,job=794917503914573825] 1143 marked index 3 as GC'd
I220908 17:59:30.881949 573820 sql/gcjob/gc_job_utils.go:289 ⋮ [n1,job=794917503914573825] 1144 updated progress payload: ‹indexes:<index_id:3 status:CLEARED > ranges_unsplit_done:true›
I220908 17:59:30.886290 573820 sql/gcjob/gc_job_utils.go:296 ⋮ [n1,job=794917503914573825] 1145 updated running status: ‹waiting for GC TTL›
I220908 17:59:30.889058 573820 jobs/registry.go:1205 ⋮ [n1] 1146 SCHEMA CHANGE GC job 794917503914573825: stepping through state succeeded with error: <nil>
tail -5 logs/3.unredacted/cockroach.log
I220908 17:59:30.765627 716258 server/migration.go:149 ⋮ [n3,bump-cluster-version] 730 active cluster version setting is now ‹1000022.1-45(fence)› (up from ‹1000022.1-44›)
I220908 17:59:30.770121 716191 server/migration.go:149 ⋮ [n3,bump-cluster-version] 731 active cluster version setting is now ‹1000022.1-46› (up from ‹1000022.1-45(fence)›)
I220908 17:59:30.780334 716309 server/migration.go:149 ⋮ [n3,bump-cluster-version] 732 active cluster version setting is now ‹1000022.1-47(fence)› (up from ‹1000022.1-46›)
I220908 17:59:31.047900 44899 jobs/wait.go:152 ⋮ [n3,intExec=‹set-version›,migration-mgr] 733 waited for 1 [794916516998709249] queued jobs to complete 4m44.045664367s
I220908 17:59:31.049316 44899 upgrade/upgradecluster/cluster.go:118 ⋮ [n3,intExec=‹set-version›,migration-mgr] 734 executing bump-cluster-version=1000022.1-17(fence) on nodes n{1,2,3,4}
[1] https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/tpcc.go#L431 [2] https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/mixed_version_jobs.go#L73 [3] https://github.com/cockroachdb/cockroach/blob/master/pkg/roachprod/install/cluster_synced.go#L2007 [4] https://github.com/cockroachdb/cockroach/blob/master/pkg/cli/cli.go#L73
Examining both system and application metrics, nothing looks anomalous. All nodes have ample system resources. Below graphs corroborate that both n1
and n3
terminate at 17:59:31
while the other two nodes continue to execute,
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 726cf22b9f06b766d857b4617dec0df18d1e5cd0:
| 283.0s 0 203.1 95.9 21.0 29.4 37.7 44.0 payment
| 283.0s 0 19.0 9.6 29.4 39.8 46.1 46.1 stockLevel
| 284.0s 0 16.0 9.4 65.0 75.5 100.7 100.7 delivery
| 284.0s 0 182.0 89.4 32.5 46.1 52.4 62.9 newOrder
| 284.0s 0 11.0 9.6 6.0 6.8 8.4 8.4 orderStatus
| 284.0s 0 197.0 96.2 19.9 27.3 32.5 35.7 payment
| 284.0s 0 19.0 9.6 29.4 50.3 56.6 56.6 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 285.0s 0 15.0 9.4 60.8 71.3 75.5 75.5 delivery
| 285.0s 0 205.8 89.8 32.5 41.9 44.0 48.2 newOrder
| 285.0s 0 16.0 9.6 6.3 11.5 11.5 11.5 orderStatus
| 285.0s 0 197.8 96.6 21.0 31.5 46.1 46.1 payment
| 285.0s 0 16.0 9.7 23.1 35.7 52.4 52.4 stockLevel
| 286.0s 0 15.0 9.4 67.1 79.7 83.9 83.9 delivery
| 286.0s 0 183.1 90.1 31.5 39.8 50.3 54.5 newOrder
| 286.0s 0 15.0 9.6 7.3 8.9 10.5 10.5 orderStatus
| 286.0s 0 178.1 96.8 21.0 28.3 35.7 56.6 payment
| 286.0s 0 16.0 9.7 30.4 41.9 46.1 46.1 stockLevel
| 287.0s 0 16.0 9.4 56.6 62.9 65.0 65.0 delivery
| 287.0s 0 192.1 90.5 32.5 41.9 48.2 52.4 newOrder
| 287.0s 0 18.0 9.6 6.8 8.9 10.5 10.5 orderStatus
| 287.0s 0 189.1 97.2 21.0 28.3 31.5 32.5 payment
| 287.0s 0 18.0 9.7 25.2 46.1 50.3 50.3 stockLevel
| 288.0s 0 18.0 9.5 62.9 71.3 75.5 75.5 delivery
| 288.0s 0 193.0 90.8 33.6 44.0 56.6 62.9 newOrder
| 288.0s 0 20.0 9.7 6.6 8.9 10.0 10.0 orderStatus
| 288.0s 0 186.0 97.5 21.0 29.4 32.5 39.8 payment
| 288.0s 0 23.0 9.8 23.1 39.8 46.1 46.1 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 289.0s 0 16.0 9.5 65.0 83.9 104.9 104.9 delivery
| 289.0s 0 174.0 91.1 35.7 60.8 83.9 96.5 newOrder
| 289.0s 0 15.0 9.7 6.6 9.4 10.5 10.5 orderStatus
| 289.0s 0 174.0 97.7 21.0 41.9 67.1 71.3 payment
| 289.0s 0 18.0 9.8 23.1 35.7 52.4 52.4 stockLevel
| 290.0s 0 11.0 9.5 60.8 67.1 71.3 71.3 delivery
| 290.0s 0 182.9 91.4 33.6 46.1 50.3 71.3 newOrder
| 290.0s 0 19.0 9.7 6.8 9.4 10.0 10.0 orderStatus
| 290.0s 0 195.9 98.1 21.0 29.4 32.5 35.7 payment
| 290.0s 0 17.0 9.8 26.2 37.7 44.0 44.0 stockLevel
Wraps: (8) secondary error attachment
| UNCLASSIFIED_PROBLEM: context canceled
| (1) UNCLASSIFIED_PROBLEM
| Wraps: (2) Node 5. Command with error:
| | ``````
| | ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json --ramp=5m0s --duration=2h0m0s --prometheus-port=0 --pprofport=33333 {pgurl:1-4}
| | ``````
| Wraps: (3) context canceled
| Error types: (1) errors.Unclassified (2) *hintdetail.withDetail (3) *errors.errorString
Wraps: (9) context canceled
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) *secondary.withSecondaryError (9) *errors.errorString
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_encrypted=true
, ROACHTEST_fs=ext4
, ROACHTEST_localSSD=true
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-disaster-recovery branch-release-21.2]
cc @cockroachdb/test-eng
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ a0bfa6dafcc206301d3a21887c374db63b377075:
| 65.0s 0 21.0 18.4 52.4 60.8 71.3 71.3 delivery
| 65.0s 0 196.1 194.2 26.2 32.5 39.8 46.1 newOrder
| 65.0s 0 11.0 18.1 7.6 8.9 9.4 9.4 orderStatus
| 65.0s 0 195.1 187.9 14.7 19.9 23.1 31.5 payment
| 65.0s 0 20.0 18.3 18.9 27.3 39.8 39.8 stockLevel
| 66.0s 0 19.0 18.5 54.5 88.1 88.1 88.1 delivery
| 66.0s 0 192.9 194.2 27.3 39.8 48.2 56.6 newOrder
| 66.0s 0 15.0 18.0 6.3 7.6 8.4 8.4 orderStatus
| 66.0s 0 197.9 188.0 16.3 23.1 25.2 32.5 payment
| 66.0s 0 16.0 18.3 18.9 39.8 48.2 48.2 stockLevel
| 67.0s 0 16.0 18.4 52.4 58.7 60.8 60.8 delivery
| 67.0s 0 196.6 194.3 27.3 37.7 46.1 52.4 newOrder
| 67.0s 0 19.0 18.0 5.8 7.3 7.6 7.6 orderStatus
| 67.0s 0 184.7 188.0 15.7 23.1 25.2 30.4 payment
| 67.0s 0 22.0 18.3 15.2 29.4 48.2 48.2 stockLevel
| 68.0s 0 16.0 18.4 54.5 65.0 65.0 65.0 delivery
| 68.0s 0 188.5 194.2 25.2 35.7 41.9 44.0 newOrder
| 68.0s 0 20.0 18.1 5.8 8.9 8.9 8.9 orderStatus
| 68.0s 0 191.5 188.0 14.7 21.0 27.3 31.5 payment
| 68.0s 0 15.0 18.3 18.9 27.3 41.9 41.9 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 69.0s 0 14.0 18.3 50.3 58.7 62.9 62.9 delivery
| 69.0s 0 213.7 194.5 26.2 37.7 52.4 54.5 newOrder
| 69.0s 0 19.0 18.1 5.5 7.6 7.9 7.9 orderStatus
| 69.0s 0 163.7 187.7 15.7 26.2 32.5 35.7 payment
| 69.0s 0 22.0 18.3 24.1 54.5 58.7 58.7 stockLevel
| 70.0s 0 17.0 18.3 52.4 65.0 65.0 65.0 delivery
| 70.0s 0 185.0 194.3 27.3 39.8 41.9 46.1 newOrder
| 70.0s 0 15.0 18.0 6.0 7.1 7.3 7.3 orderStatus
| 70.0s 0 164.0 187.3 15.7 24.1 29.4 32.5 payment
| 70.0s 0 18.0 18.3 16.3 31.5 46.1 46.1 stockLevel
| 71.0s 0 11.0 18.2 56.6 67.1 67.1 67.1 delivery
| 71.0s 0 218.3 194.7 29.4 56.6 79.7 79.7 newOrder
| 71.0s 0 22.0 18.1 6.6 12.1 14.2 14.2 orderStatus
| 71.0s 0 199.2 187.5 17.8 32.5 46.1 65.0 payment
| 71.0s 0 13.0 18.3 22.0 41.9 48.2 48.2 stockLevel
| 72.0s 0 7.0 18.0 56.6 67.1 67.1 67.1 delivery
| 72.0s 0 111.0 193.5 28.3 37.7 41.9 44.0 newOrder
| 72.0s 0 7.0 17.9 6.6 7.3 7.3 7.3 orderStatus
| 72.0s 0 93.0 186.2 17.8 23.1 28.3 32.5 payment
| 72.0s 0 6.0 18.1 19.9 52.4 52.4 52.4 stockLevel
Wraps: (8) COMMAND_PROBLEM
Wraps: (9) Node 5. Command with error:
| ``````
| ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json --ramp=5m0s --duration=2h0m0s --prometheus-port=0 --pprofport=33333 {pgurl:1-4}
| ``````
Wraps: (10) exit status 1
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError
versionupgrade.go:530,versionupgrade.go:197,tpcc.go:432,test_runner.go:928: context canceled
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_encrypted=false
, ROACHTEST_fs=ext4
, ROACHTEST_localSSD=true
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #88668 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot blocks-22.2.0-beta.2 branch-release-22.2 release-blocker] - #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-disaster-recovery branch-release-21.2]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 84384b50c023dd4c05fff76af85a6975f5d2b0ab:
| 252.0s 0 162.0 79.1 25.2 35.7 46.1 54.5 newOrder
| 252.0s 0 19.0 8.8 7.3 8.4 8.9 8.9 orderStatus
| 252.0s 0 157.0 86.1 13.6 26.2 30.4 32.5 payment
| 252.0s 0 13.0 8.8 16.3 25.2 31.5 31.5 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 253.0s 0 10.0 8.5 54.5 83.9 83.9 83.9 delivery
| 253.0s 0 160.1 79.4 23.1 28.3 35.7 41.9 newOrder
| 253.0s 0 23.0 8.8 5.8 8.9 15.7 15.7 orderStatus
| 253.0s 0 143.1 86.3 13.1 24.1 37.7 39.8 payment
| 253.0s 0 12.0 8.8 15.7 31.5 31.5 31.5 stockLevel
| 254.0s 0 15.0 8.5 54.5 67.1 75.5 75.5 delivery
| 254.0s 0 139.0 79.6 25.2 35.7 46.1 75.5 newOrder
| 254.0s 0 10.0 8.9 6.8 8.9 8.9 8.9 orderStatus
| 254.0s 0 173.0 86.6 13.6 22.0 31.5 48.2 payment
| 254.0s 0 15.0 8.8 22.0 31.5 50.3 50.3 stockLevel
| 255.0s 0 7.0 8.5 54.5 56.6 56.6 56.6 delivery
| 255.0s 0 156.0 79.9 25.2 33.6 39.8 50.3 newOrder
| 255.0s 0 14.0 8.9 6.3 8.4 10.5 10.5 orderStatus
| 255.0s 0 181.0 87.0 13.6 25.2 28.3 33.6 payment
| 255.0s 0 24.0 8.9 13.1 28.3 31.5 31.5 stockLevel
| 256.0s 0 10.0 8.5 50.3 113.2 113.2 113.2 delivery
| 256.0s 0 140.0 80.2 25.2 44.0 48.2 54.5 newOrder
| 256.0s 0 12.0 8.9 6.0 7.9 8.9 8.9 orderStatus
| 256.0s 0 191.9 87.4 14.2 35.7 46.1 48.2 payment
| 256.0s 0 13.0 8.9 17.8 27.3 27.3 27.3 stockLevel
| _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
| 257.0s 0 12.0 8.5 56.6 71.3 75.5 75.5 delivery
| 257.0s 0 184.1 80.6 26.2 32.5 39.8 41.9 newOrder
| 257.0s 0 9.0 8.9 6.8 8.9 8.9 8.9 orderStatus
| 257.0s 0 152.1 87.7 14.7 22.0 29.4 37.7 payment
| 257.0s 0 13.0 8.9 19.9 27.3 39.8 39.8 stockLevel
| 258.0s 0 24.0 8.6 56.6 71.3 83.9 83.9 delivery
| 258.0s 0 175.8 81.0 25.2 37.7 46.1 62.9 newOrder
| 258.0s 0 19.0 8.9 7.9 11.0 11.0 11.0 orderStatus
| 258.0s 0 165.8 88.0 14.2 23.1 39.8 41.9 payment
| 258.0s 0 15.0 8.9 18.9 24.1 41.9 41.9 stockLevel
| 259.0s 0 12.0 8.6 54.5 62.9 79.7 79.7 delivery
| 259.0s 0 137.0 81.2 25.2 33.6 41.9 46.1 newOrder
| 259.0s 0 17.0 9.0 7.9 10.5 10.5 10.5 orderStatus
| 259.0s 0 156.0 88.2 13.6 19.9 24.1 37.7 payment
| 259.0s 0 17.0 8.9 18.9 28.3 39.8 39.8 stockLevel
Wraps: (8) COMMAND_PROBLEM
Wraps: (9) Node 5. Command with error:
| ``````
| ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json --ramp=5m0s --duration=2h0m0s --prometheus-port=0 --pprofport=33333 {pgurl:1-4}
| ``````
Wraps: (10) exit status 1
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *cluster.WithCommandDetails (8) errors.Cmd (9) *hintdetail.withDetail (10) *exec.ExitError
versionupgrade.go:530,versionupgrade.go:197,tpcc.go:432,test_runner.go:928: pq: query execution canceled
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_encrypted=true
, ROACHTEST_fs=zfs
, ROACHTEST_localSSD=true
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #88668 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-22.2] - #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-disaster-recovery branch-release-21.2]
Latest failure has the same failure mode,
Oct 03 15:59:13 teamcity-6749797-1664774404-105-n5cpu16-0003 systemd[1]: cockroach.service: Main process exited, code=exited, status=1/FAILURE
Ongoing internal investigation: https://cockroachlabs.slack.com/archives/C01CDD4HRC5/p1664819770906019?thread_ts=1664295784.890119&cid=C01CDD4HRC5
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ e06d2286b011096526eda7f2d7f7bb7acea0ae84:
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
(versionupgrade.go:533).setClusterSettingVersionStep: pq: rpc error: code = Unavailable desc = error reading from server: read tcp 10.142.1.113:47054->10.142.1.79:26257: read: connection reset by peer
(monitor.go:127).Wait: monitor failure: monitor task failed: output in run_144818.024420287_n5_cockroach_workload_run_tpcc: ./cockroach workload run tpcc --warehouses=909 --histograms=perf/stats.json --ramp=5m0s --duration=2h0m0s --prometheus-port=0 --pprofport=33333 {pgurl:1-4} returned: context canceled
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_encrypted=true
, ROACHTEST_fs=ext4
, ROACHTEST_localSSD=true
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #88668 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-22.2] - #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-disaster-recovery branch-release-21.2]
Yet another example of a node doing exit 1
without any stack trace.
In test.log
,
14:48:18 tpcc.go:254: test worker status: running tpcc worker=0 warehouses=909 ramp=5m0s duration=2h0m0s on {pgurl:1-4} (<1m0s)
In journalctl.txt
,
2.journalctl.txt:Oct 08 14:52:40 teamcity-6837129-1665206365-100-n5cpu16-0002 systemd[1]: cockroach.service: Main process exited, code=exited, status=1/FAILURE
In cockroach-pebble
, the last upgraded format version is 008
,
I221008 14:48:10.546483 46770 3@pebble/event.go:645 ⋮ [n2,pebble,s2] 5555 upgraded to format version: ‹008›
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 7be0b20edbc336200c1510a9c6f1d76ae2f92c3a:
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
(monitor.go:127).Wait: monitor failure: monitor task failed: output in run_142544.008795915_n1_v2216cockroach_workload_fixtures_import_bank: v22.1.6/cockroach workload fixtures import bank --payload-bytes=10240 --rows=32552083 --seed=4 --db=bigbank returned: SSH_PROBLEM: exit status 255
(test_runner.go:1062).teardownTest: test timed out (0s)
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_encrypted=false
, ROACHTEST_fs=zfs
, ROACHTEST_localSSD=true
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #89755 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-22.2.0 release-blocker] - #88668 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-22.2] - #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-disaster-recovery branch-release-21.2]
Last failure is an entirely different failure mode. The bank import step appears to run for hours until it's killed due to test time out.
The preceding step to import tpcc completes @14:25
,
I221015 14:25:41.948258 1 ccl/workloadccl/fixture.go:326 [-] 11 imported 62 GiB bytes in 9 tables (took 4m59.344260633s, 212.78 MiB/s)
The bank import starts immediately after,
run_142544.008795915_n1_v2216cockroach_workload_fixtures_import_bank: 14:25:44 cluster.go:291: > v22.1.6/cockroach workload fixtures import bank --payload-bytes=10240 --rows=32552083 --seed=4 --db=bigbank
I221015 14:25:44.936592 1 ccl/workloadccl/fixture.go:318 [-] 1 starting import of 1 tables
All nodes appear to be live for the remaining ~10 hours,
@stevendanna Would you mind taking a look at the logs to see what could possible have caused the import to run for ~10 hours. The last warning message concerning the import is @15:35
,
logs/3.unredacted/cockroach.log:W221015 15:35:13.557144 86844 kv/bulk/sst_batcher.go:469 ⋮ [n3,f‹d1df2c12›,job=805350752177750017] 25254 ‹bank rows› failed to scatter : existing range size 10496962 exceeds specified limit 4194304
On n2
we see these warnings every minute, starting @15:08
, ~6 minutes after the split is initiated,
logs/2.unredacted/cockroach.log:I221015 15:02:47.223326 373620 kv/kvserver/pkg/kv/kvserver/replica_command.go:420 ⋮ [n2,s2,r6742/1:‹/Table/181/1/{284963…-325520…}›] 18370 initiating a split of this range at key ‹/Table/181/1/28503022› [r6746] (‹manual›)‹›
logs/2.unredacted/cockroach.log:I221015 15:02:47.346852 373677 kv/kvserver/pkg/kv/kvserver/replica_command.go:2260 ⋮ [n2,s2,r6746/1:‹/Table/181/1/{285030…-325520…}›] 18375 change replicas (add [(n4,s4):4LEARNER] remove []): existing descriptor r6746:‹/Table/181/1/{28503022-32552000}› [(n2,s2):1, (n1,s1):2, (n3,s3):3, next=4, gen=3665, sticky=1665846767.222826180,0]
logs/2.unredacted/cockroach.log:W221015 15:08:49.363284 680702 kv/kvserver/pkg/kv/kvserver/merge_queue.go:411 ⋮ [n2,merge,s2,r6746/1:‹/Table/181/1/{285030…-325520…}›] 19925 ‹kv/kvserver/pkg/kv/kvserver/replica_command.go›:810: merge failed: fetching current range descriptor value: context deadline exceeded
logs/2.unredacted/cockroach.log:W221015 15:09:49.364832 741327 kv/kvclient/kvcoord/dist_sender.go:1602 ⋮ [n2,merge,s2,r6746/1:‹/Table/181/1/{285030…-325520…}›] 20262 slow range RPC: have been waiting 60.00s (1 attempts) for RPC Get [‹/Local/Range/Table/181/1/28503022/RangeDescriptor›,‹/Min›), [txn: c5a14092], [can-forward-ts] to r6746:‹/Table/181/1/{28503022-32552000}› [(n2,s2):1, (n1,s1):2, (n3,s3):3, next=4, gen=3665, sticky=1665846767.222826180,0]; resp: ‹(err: context deadline exceeded: "merge" meta={id=c5a14092 key=/Local/Range/Table/181/1/28503022/RangeDescriptor pri=0.00562966 epo=0 ts=1665846529.364048180,0 min=1665846529.364048180,0 seq=0} lock=true stat=PENDING rts=1665846529.364048180,0 wto=false gul=1665846529.864048180,0)›
and persisting until the time out @00:19:50
,
W221016 00:19:50.341423 741327 kv/kvserver/pkg/kv/kvserver/merge_queue.go:411 ⋮ [n2,merge,s2,r6746/1:‹/Table/181/1/{285030…-325520…}›] 28905 ‹kv/kvserver/pkg/kv/kvserver/replica_command.go›:810: merge failed: fetching current range descriptor value: context deadline exceeded
Was the panic provided in cockroachdb/pebble#2019 not making it into the logs? This appears to be the source of failures going back to Aug 29. Do we know why the panic didn't make it into the logs?
Was the panic provided in cockroachdb/pebble#2019 not making it into the logs? This appears to be the source of failures going back to Aug 29. Do we know why the panic didn't make it into the logs?
Indeed, it never made it to the logs, which is what made debugging this test failure difficult. We are looking into why the crash was never logged (internal discussion).
This (exit 1
) should be fixed by #90406. Closing so that we get a new issue for any future failures.
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 8d34ef1ea15850ee1c70470610b6652df4c317de:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=16
,ROACHTEST_ssd=0
Help
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
Same failure on other branches
- #74892 roachtest: tpcc/mixed-headroom/n5cpu16 failed [OOM during import while running 21.2] [C-test-failure O-roachtest O-robot T-bulkio branch-release-21.2]
/cc @cockroachdb/kv-triage
This test on roachdash | Improve this report!
Jira issue: CRDB-16849
Epic CRDB-19172