cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

roachtest: kv/splits/nodes=3/quiesce=true failed #88658

Closed cockroach-teamcity closed 2 years ago

cockroach-teamcity commented 2 years ago

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 89f4ad907a1756551bd6864c3e8516eeff6b0e0a:

          | r12    0x2e2
          | r13    0x1
          | r14    0xc001bd2b60
          | r15    0x1
          | rip    0x49a101
          | rflags 0x286
          | cs     0x33
          | fs     0x0
          | gs     0x0
          |
          | stdout:
        Wraps: (4) SSH_PROBLEM
        Wraps: (5) Node 4. Command with error:
          | ``````
          | ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3}
          | ``````
        Wraps: (6) exit status 255
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.SSH (5) *hintdetail.withDetail (6) *exec.ExitError

    monitor.go:127,kv.go:729,test_runner.go:928: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     main/pkg/cmd/roachtest/monitor.go:115
          | main.(*monitorImpl).Wait
          |     main/pkg/cmd/roachtest/monitor.go:123
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKVSplits.func1
          |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:729
          | main.(*testRunner).runTest.func2
          |     main/pkg/cmd/roachtest/test_runner.go:928
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     main/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     main/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     GOROOT/src/runtime/proc.go:6340
          | runtime.main
          |     GOROOT/src/runtime/proc.go:233
          | runtime.goexit
          |     GOROOT/src/runtime/asm_amd64.s:1594
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

    test_runner.go:1059,test_runner.go:958: test timed out (2h0m0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-19868

cockroach-teamcity commented 2 years ago

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 51c8aae748d338549400c047796c6c9b892527da:

          | r12    0xc000086a00
          | r13    0x1
          | r14    0xc0021021a0
          | r15    0xffffffffffffffff
          | rip    0x49a101
          | rflags 0x286
          | cs     0x33
          | fs     0x0
          | gs     0x0
          |
          | stdout:
        Wraps: (4) SSH_PROBLEM
        Wraps: (5) Node 4. Command with error:
          | ``````
          | ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3}
          | ``````
        Wraps: (6) exit status 255
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.SSH (5) *hintdetail.withDetail (6) *exec.ExitError

    monitor.go:127,kv.go:729,test_runner.go:928: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     main/pkg/cmd/roachtest/monitor.go:115
          | main.(*monitorImpl).Wait
          |     main/pkg/cmd/roachtest/monitor.go:123
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKVSplits.func1
          |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:729
          | main.(*testRunner).runTest.func2
          |     main/pkg/cmd/roachtest/test_runner.go:928
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     main/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     main/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     GOROOT/src/runtime/proc.go:6340
          | runtime.main
          |     GOROOT/src/runtime/proc.go:233
          | runtime.goexit
          |     GOROOT/src/runtime/asm_amd64.s:1594
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

    test_runner.go:1059,test_runner.go:958: test timed out (2h0m0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

This test on roachdash | Improve this report!

cockroach-teamcity commented 2 years ago

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ ff13325e9368c4e8dd9a4d5cf4aa2ad2f33e9ac0:

          | r12    0x358
          | r13    0x3
          | r14    0xc000102b60
          | r15    0x1
          | rip    0x49a101
          | rflags 0x286
          | cs     0x33
          | fs     0x0
          | gs     0x0
          |
          | stdout:
        Wraps: (4) SSH_PROBLEM
        Wraps: (5) Node 4. Command with error:
          | ``````
          | ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3}
          | ``````
        Wraps: (6) exit status 255
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.SSH (5) *hintdetail.withDetail (6) *exec.ExitError

    monitor.go:127,kv.go:729,test_runner.go:928: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     main/pkg/cmd/roachtest/monitor.go:115
          | main.(*monitorImpl).Wait
          |     main/pkg/cmd/roachtest/monitor.go:123
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKVSplits.func1
          |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:729
          | main.(*testRunner).runTest.func2
          |     main/pkg/cmd/roachtest/test_runner.go:928
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     main/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     main/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     GOROOT/src/runtime/proc.go:6340
          | runtime.main
          |     GOROOT/src/runtime/proc.go:233
          | runtime.goexit
          |     GOROOT/src/runtime/asm_amd64.s:1594
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

    test_runner.go:1059,test_runner.go:958: test timed out (2h0m0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

This test on roachdash | Improve this report!

nvanbenschoten commented 2 years ago

This looks similar to https://github.com/cockroachdb/cockroach/issues/88678. We should fold the initial investigation into that one.

cockroach-teamcity commented 2 years ago

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ ff13325e9368c4e8dd9a4d5cf4aa2ad2f33e9ac0:

          | r12    0x7f714c1fdc48
          | r13    0x3
          | r14    0xc000602340
          | r15    0x7f717881b5c0
          | rip    0x49a101
          | rflags 0x286
          | cs     0x33
          | fs     0x0
          | gs     0x0
          |
          | stdout:
        Wraps: (4) SSH_PROBLEM
        Wraps: (5) Node 4. Command with error:
          | ``````
          | ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3}
          | ``````
        Wraps: (6) exit status 255
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.SSH (5) *hintdetail.withDetail (6) *exec.ExitError

    monitor.go:127,kv.go:729,test_runner.go:928: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     main/pkg/cmd/roachtest/monitor.go:115
          | main.(*monitorImpl).Wait
          |     main/pkg/cmd/roachtest/monitor.go:123
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKVSplits.func1
          |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:729
          | main.(*testRunner).runTest.func2
          |     main/pkg/cmd/roachtest/test_runner.go:928
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     main/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     main/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     GOROOT/src/runtime/proc.go:6340
          | runtime.main
          |     GOROOT/src/runtime/proc.go:233
          | runtime.goexit
          |     GOROOT/src/runtime/asm_amd64.s:1594
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

    test_runner.go:1059,test_runner.go:958: test timed out (2h0m0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

This test on roachdash | Improve this report!

nvanbenschoten commented 2 years ago

5/5 failures on master (beb40b52b380bf4ee15349445dbc674acafa5046) when stressing this test:

GCE_PROJECT=andrei-jepsen ./pkg/cmd/roachtest/roachstress.sh -c5 'kv/splits/nodes=3/quiesce=true$$' -- --cpu-quota=1000
erikgrinaker commented 2 years ago

I got 4/5 failures on master (8107342458), but only 1/5 when reverting gRPC to 1.46. Doing another set to confirm.

Reverting gRPC to 1.46.0 in #88745 and #88749.

erikgrinaker commented 2 years ago

Second set had 3/5 failures on master, 1/5 with gRPC reverted. Failure modes are also different (master has tripped replica circuit breakers, gRPC revert didn't).

The gRPC reverts should address the proximate cause here, we should look into the other failures separately.

cockroach-teamcity commented 2 years ago

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ a0bfa6dafcc206301d3a21887c374db63b377075:

          | r12    0x425
          | r13    0x3
          | r14    0xc000102b60
          | r15    0x1
          | rip    0x49a101
          | rflags 0x286
          | cs     0x33
          | fs     0x0
          | gs     0x0
          |
          | stdout:
        Wraps: (4) SSH_PROBLEM
        Wraps: (5) Node 4. Command with error:
          | ``````
          | ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3}
          | ``````
        Wraps: (6) exit status 255
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) errors.SSH (5) *hintdetail.withDetail (6) *exec.ExitError

    monitor.go:127,kv.go:729,test_runner.go:928: monitor failure: monitor task failed: t.Fatal() was called
        (1) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).WaitE
          |     main/pkg/cmd/roachtest/monitor.go:115
          | main.(*monitorImpl).Wait
          |     main/pkg/cmd/roachtest/monitor.go:123
          | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKVSplits.func1
          |     github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:729
          | main.(*testRunner).runTest.func2
          |     main/pkg/cmd/roachtest/test_runner.go:928
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          -- stack trace:
          | main.(*monitorImpl).wait.func2
          |     main/pkg/cmd/roachtest/monitor.go:171
        Wraps: (4) monitor task failed
        Wraps: (5) attached stack trace
          -- stack trace:
          | main.init
          |     main/pkg/cmd/roachtest/monitor.go:80
          | runtime.doInit
          |     GOROOT/src/runtime/proc.go:6340
          | runtime.main
          |     GOROOT/src/runtime/proc.go:233
          | runtime.goexit
          |     GOROOT/src/runtime/asm_amd64.s:1594
        Wraps: (6) t.Fatal() was called
        Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

    test_runner.go:1059,test_runner.go:958: test timed out (2h0m0s)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

This test on roachdash | Improve this report!