Closed cockroach-teamcity closed 2 years ago
node 6 was OOM killed according to 6.dmesg.txt
This looks like https://github.com/cockroachdb/cockroach/issues/73376. cc: @tbg incase the artifacts help further the investigation.
Removing the release-blocker
label here, since this is a known issue that pre-dates 22.1.
cc @cockroachdb/replication
roachtest.import/tpcc/warehouses=4000/geo failed with artifacts on master @ a82711442c65cf14489c55041b45b11a1e38415b:
Wraps: (2) output in run_100123.166589701_n1_cockroach_workload_fixtures_import_tpcc
Wraps: (3) ./cockroach workload fixtures import tpcc --warehouses=4000 --csv-server='http://localhost:8081' returned
| stderr:
| I220909 10:01:25.224016 1 ccl/workloadccl/fixture.go:318 [-] 1 starting import of 9 tables
| I220909 10:01:32.075472 79 ccl/workloadccl/fixture.go:481 [-] 2 imported 7.9 MiB in item table (100000 rows, 0 index entries, took 6.147239254s, 1.28 MiB/s)
| I220909 10:01:32.449515 74 ccl/workloadccl/fixture.go:481 [-] 3 imported 3.9 MiB in district table (40000 rows, 0 index entries, took 6.521467844s, 0.60 MiB/s)
| I220909 10:01:33.339839 73 ccl/workloadccl/fixture.go:481 [-] 4 imported 213 KiB in warehouse table (4000 rows, 0 index entries, took 7.411826759s, 0.03 MiB/s)
| I220909 10:02:29.302066 78 ccl/workloadccl/fixture.go:481 [-] 5 imported 546 MiB in new_order table (36000000 rows, 0 index entries, took 1m3.37385474s, 8.62 MiB/s)
| Error: importing fixture: importing table history: pq: job 795106626557018113: could not mark as reverting: job 795106626557018113: with status running: expected session "aef9d1829fda40ec8aed76104bc9c51d" but found NULL
|
| stdout:
Wraps: (4) COMMAND_PROBLEM
Wraps: (5) Node 1. Command with error:
| ``````
| ./cockroach workload fixtures import tpcc --warehouses=4000 --csv-server='http://localhost:8081'
| ``````
Wraps: (6) exit status 1
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError
monitor.go:127,import.go:154,import.go:181,test_runner.go:906: monitor failure: monitor task failed: t.Fatal() was called
(1) attached stack trace
-- stack trace:
| main.(*monitorImpl).WaitE
| main/pkg/cmd/roachtest/monitor.go:115
| main.(*monitorImpl).Wait
| main/pkg/cmd/roachtest/monitor.go:123
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func1
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:154
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func3
| github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:181
| main.(*testRunner).runTest.func2
| main/pkg/cmd/roachtest/test_runner.go:906
Wraps: (2) monitor failure
Wraps: (3) attached stack trace
-- stack trace:
| main.(*monitorImpl).wait.func2
| main/pkg/cmd/roachtest/monitor.go:171
Wraps: (4) monitor task failed
Wraps: (5) attached stack trace
-- stack trace:
| main.init
| main/pkg/cmd/roachtest/monitor.go:80
| runtime.doInit
| GOROOT/src/runtime/proc.go:6340
| runtime.main
| GOROOT/src/runtime/proc.go:233
| runtime.goexit
| GOROOT/src/runtime/asm_amd64.s:1594
Wraps: (6) t.Fatal() was called
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError
Parameters: ROACHTEST_cloud=gce
, ROACHTEST_cpu=16
, ROACHTEST_ssd=0
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
- #81430 roachtest: import/tpcc/warehouses=4000/geo failed [C-test-failure O-roachtest O-robot T-bulkio branch-release-22.1] - #76824 roachtest: import/tpcc/warehouses=4000/geo failed [raft sideload oom] [C-test-failure O-roachtest O-robot S-3 T-kv-replication X-nostale branch-release-21.2 no-test-failure-activity]
This seems to be a different failure mode than a raft sideload oom. It's really unfortunate that TC linked this new failure mode to the raft failure issue (i'll follow up with test eng on this). What I see so far:
importing fixture: importing table history: pq: job 795106626557018113: could not mark as reverting: job 795106626557018113: with status running: expected session "aef9d1829fda40ec8aed76104bc9c51d" but found NULL
crdb_internal.jobs.txt
, but does show up up in system.jobs.txtI'll let current L2 further investigate.
Btw, another way to avoid roachtest reuse of this issue is to remove the O-roachtest label (but of course that is a lie: this issue did originate with roachtest).
If this is now tracking the most recent posted failure on it, the "job ID is missing" one, then I'm removing "release-blocker" from this since that smells like some jobs vs testing flake and we haven't seen it again.
roachtest.import/tpcc/warehouses=4000/geo failed with artifacts on master @ 1129fbc650fe3a037b03aea1e5f1d8078618cb1c:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=16
,ROACHTEST_ssd=0
Help
See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)
Same failure on other branches
- #81430 roachtest: import/tpcc/warehouses=4000/geo failed [C-test-failure O-roachtest O-robot T-bulkio branch-release-22.1] - #76824 roachtest: import/tpcc/warehouses=4000/geo failed [raft sideload oom] [C-test-failure O-roachtest O-robot S-3 T-kv-replication X-nostale branch-release-21.2 no-test-failure-activity]
/cc @cockroachdb/bulk-io
This test on roachdash | Improve this report!
Jira issue: CRDB-18177