cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.13k stars 3.81k forks source link

roachtest: import/tpcc/warehouses=4000/geo failed (job session ID missing) #85310

Closed cockroach-teamcity closed 2 years ago

cockroach-teamcity commented 2 years ago

roachtest.import/tpcc/warehouses=4000/geo failed with artifacts on master @ 1129fbc650fe3a037b03aea1e5f1d8078618cb1c:

          | golang.org/x/sync/errgroup.(*Group).Go.func1
          |     golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:74
          | runtime.goexit
          |     GOROOT/src/runtime/asm_amd64.s:1571
        Wraps: (2) output in run_102950.598422633_n1_cockroach_workload_fixtures_import_tpcc
        Wraps: (3) ./cockroach workload fixtures import tpcc --warehouses=4000 --csv-server='http://localhost:8081' returned
          | stderr:
          | I220729 10:29:52.480591 1 ccl/workloadccl/fixture.go:318  [-] 1  starting import of 9 tables
          | I220729 10:29:54.287002 102 ccl/workloadccl/fixture.go:481  [-] 2  imported 213 KiB in warehouse table (4000 rows, 0 index entries, took 1.543459645s, 0.13 MiB/s)
          | I220729 10:29:54.287352 103 ccl/workloadccl/fixture.go:481  [-] 3  imported 3.9 MiB in district table (40000 rows, 0 index entries, took 1.543742349s, 2.55 MiB/s)
          | I220729 10:29:54.504497 108 ccl/workloadccl/fixture.go:481  [-] 4  imported 7.9 MiB in item table (100000 rows, 0 index entries, took 1.760760909s, 4.47 MiB/s)
          | I220729 10:31:45.316682 107 ccl/workloadccl/fixture.go:481  [-] 5  imported 546 MiB in new_order table (36000000 rows, 0 index entries, took 1m52.572980871s, 4.85 MiB/s)
          | I220729 10:41:09.656506 105 ccl/workloadccl/fixture.go:481  [-] 6  imported 8.6 GiB in history table (120000000 rows, 0 index entries, took 11m16.912815734s, 13.02 MiB/s)
          | I220729 10:42:34.548314 106 ccl/workloadccl/fixture.go:481  [-] 7  imported 6.5 GiB in order table (120000000 rows, 120000000 index entries, took 12m41.804603025s, 8.76 MiB/s)
          |
          | stdout:
        Wraps: (4) secondary error attachment
          | UNCLASSIFIED_PROBLEM: context canceled
          | (1) UNCLASSIFIED_PROBLEM
          | Wraps: (2) Node 1. Command with error:
          |   | ``````
          |   | ./cockroach workload fixtures import tpcc --warehouses=4000 --csv-server='http://localhost:8081'
          |   | ``````
          | Wraps: (3) context canceled
          | Error types: (1) errors.Unclassified (2) *hintdetail.withDetail (3) *errors.errorString
        Wraps: (5) context canceled
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

    monitor.go:127,import.go:154,import.go:181,test_runner.go:896: monitor failure: monitor command failure: unexpected node event: 6: dead (exit status 137)
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     main/pkg/cmd/roachtest/monitor.go:115
          | main.(*monitorImpl).Wait
          |     main/pkg/cmd/roachtest/monitor.go:123
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func1
          |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:154
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func3
          |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:181
          | [...repeated from below...]
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func3
          |     main/pkg/cmd/roachtest/monitor.go:202
          | runtime.goexit
          |     GOROOT/src/runtime/asm_amd64.s:1571
        Wraps: (4) monitor command failure
        Wraps: (5) unexpected node event: 6: dead (exit status 137)
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #81430 roachtest: import/tpcc/warehouses=4000/geo failed [C-test-failure O-roachtest O-robot T-bulkio branch-release-22.1] - #76824 roachtest: import/tpcc/warehouses=4000/geo failed [raft sideload oom] [C-test-failure O-roachtest O-robot S-3 T-kv-replication X-nostale branch-release-21.2 no-test-failure-activity]

/cc @cockroachdb/bulk-io

This test on roachdash | Improve this report!

Jira issue: CRDB-18177

adityamaru commented 2 years ago

node 6 was OOM killed according to 6.dmesg.txt

adityamaru commented 2 years ago
Screen Shot 2022-07-29 at 10 45 04 AM

This looks like https://github.com/cockroachdb/cockroach/issues/73376. cc: @tbg incase the artifacts help further the investigation.

erikgrinaker commented 2 years ago

Removing the release-blocker label here, since this is a known issue that pre-dates 22.1.

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/replication

cockroach-teamcity commented 2 years ago

roachtest.import/tpcc/warehouses=4000/geo failed with artifacts on master @ a82711442c65cf14489c55041b45b11a1e38415b:

        Wraps: (2) output in run_100123.166589701_n1_cockroach_workload_fixtures_import_tpcc
        Wraps: (3) ./cockroach workload fixtures import tpcc --warehouses=4000 --csv-server='http://localhost:8081' returned
          | stderr:
          | I220909 10:01:25.224016 1 ccl/workloadccl/fixture.go:318  [-] 1  starting import of 9 tables
          | I220909 10:01:32.075472 79 ccl/workloadccl/fixture.go:481  [-] 2  imported 7.9 MiB in item table (100000 rows, 0 index entries, took 6.147239254s, 1.28 MiB/s)
          | I220909 10:01:32.449515 74 ccl/workloadccl/fixture.go:481  [-] 3  imported 3.9 MiB in district table (40000 rows, 0 index entries, took 6.521467844s, 0.60 MiB/s)
          | I220909 10:01:33.339839 73 ccl/workloadccl/fixture.go:481  [-] 4  imported 213 KiB in warehouse table (4000 rows, 0 index entries, took 7.411826759s, 0.03 MiB/s)
          | I220909 10:02:29.302066 78 ccl/workloadccl/fixture.go:481  [-] 5  imported 546 MiB in new_order table (36000000 rows, 0 index entries, took 1m3.37385474s, 8.62 MiB/s)
          | Error: importing fixture: importing table history: pq: job 795106626557018113: could not mark as reverting: job 795106626557018113: with status running: expected session "aef9d1829fda40ec8aed76104bc9c51d" but found NULL
          |
          | stdout:
        Wraps: (4) COMMAND_PROBLEM
        Wraps: (5) Node 1. Command with error:
          | ``````
          | ./cockroach workload fixtures import tpcc --warehouses=4000 --csv-server='http://localhost:8081'
          | ``````
        Wraps: (6) exit status 1
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.Cmd (5) *hintdetail.withDetail (6) *exec.ExitError

    monitor.go:127,import.go:154,import.go:181,test_runner.go:906: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     main/pkg/cmd/roachtest/monitor.go:115
          | main.(*monitorImpl).Wait
          |     main/pkg/cmd/roachtest/monitor.go:123
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func1
          |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:154
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerImportTPCC.func3
          |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/import.go:181
          | main.(*testRunner).runTest.func2
          |     main/pkg/cmd/roachtest/test_runner.go:906
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     main/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     main/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     GOROOT/src/runtime/proc.go:6340
          | runtime.main
          |     GOROOT/src/runtime/proc.go:233
          | runtime.goexit
          |     GOROOT/src/runtime/asm_amd64.s:1594
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

Same failure on other branches

- #81430 roachtest: import/tpcc/warehouses=4000/geo failed [C-test-failure O-roachtest O-robot T-bulkio branch-release-22.1] - #76824 roachtest: import/tpcc/warehouses=4000/geo failed [raft sideload oom] [C-test-failure O-roachtest O-robot S-3 T-kv-replication X-nostale branch-release-21.2 no-test-failure-activity]

This test on roachdash | Improve this report!

msbutler commented 2 years ago

This seems to be a different failure mode than a raft sideload oom. It's really unfortunate that TC linked this new failure mode to the raft failure issue (i'll follow up with test eng on this). What I see so far:

I'll let current L2 further investigate.

tbg commented 2 years ago

Thread with discussion about issue reuse

tbg commented 2 years ago

Btw, another way to avoid roachtest reuse of this issue is to remove the O-roachtest label (but of course that is a lie: this issue did originate with roachtest).

dt commented 2 years ago

If this is now tracking the most recent posted failure on it, the "job ID is missing" one, then I'm removing "release-blocker" from this since that smells like some jobs vs testing flake and we haven't seen it again.