cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.9k stars 3.78k forks source link

roachtest: c2c/disconnect failed [ssh flake while starting grafana in latest failure] #130473

Closed cockroach-teamcity closed 44 minutes ago

cockroach-teamcity commented 1 week ago

roachtest.c2c/disconnect failed with artifacts on release-24.2 @ 90b634dc4a9c7da1d37b2d845272b19b3ff10f44:

(test_runner.go:1284).runTest: test timed out (20m0s)
test artifacts and logs in: /artifacts/c2c/disconnect/cpu_arch=arm64/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-42071

cockroach-teamcity commented 3 days ago

roachtest.c2c/disconnect failed with artifacts on release-24.2 @ 7a32a78a1f7a691f32a131d79f6ae00a19e20e86:

                      |   |     github.com/cockroachdb/cockroach/pkg/roachprod/install/session.go:143
                      |   | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*remoteSession).CombinedOutput.func1
                      |   |     github.com/cockroachdb/cockroach/pkg/roachprod/install/session.go:156
                      |   | runtime.goexit
                      |   |     src/runtime/asm_amd64.s:1695
                      | Wraps: (3) _potential_ SSH flake (``ssh -vvv`` log retained in /artifacts/c2c/disconnect/cpu_arch=arm64/run_1/ssh/ssh_095013.962408842_n4_cd-nodeexporter-sudo.log)
                      | Wraps: (4) TRANSIENT_ERROR(ssh_problem)
                      | Wraps: (5) exit status 255
                      | Error types: (1) *hintdetail.withDetail (2) *withstack.withStack (3) *errutil.withPrefix (4) errors.TransientError (5) *exec.ExitError
                    Wraps: (7) secondary error attachment
                      | _potential_ SSH flake (``ssh -vvv`` log retained in /artifacts/c2c/disconnect/cpu_arch=arm64/run_1/ssh/ssh_094645.665410474_n4_cd-nodeexporter-sudo.log): TRANSIENT_ERROR(ssh_problem): exit status 255
                      | (1) Node 4. Command with error:
                      |   | ``````
                      |   | cd node_exporter &&
                      |   | sudo systemd-run --unit node_exporter --same-dir ./node_exporter
                      |   | ``````
                      |   | <no output>
                      | Wraps: (2) attached stack trace
                      |   -- stack trace:
                      |   | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*remoteSession).errWithDebug
                      |   |     github.com/cockroachdb/cockroach/pkg/roachprod/install/session.go:143
                      |   | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*remoteSession).CombinedOutput.func1
                      |   |     github.com/cockroachdb/cockroach/pkg/roachprod/install/session.go:156
                      |   | runtime.goexit
                      |   |     src/runtime/asm_amd64.s:1695
                      | Wraps: (3) _potential_ SSH flake (``ssh -vvv`` log retained in /artifacts/c2c/disconnect/cpu_arch=arm64/run_1/ssh/ssh_094645.665410474_n4_cd-nodeexporter-sudo.log)
                      | Wraps: (4) TRANSIENT_ERROR(ssh_problem)
                      | Wraps: (5) exit status 255
                      | Error types: (1) *hintdetail.withDetail (2) *withstack.withStack (3) *errutil.withPrefix (4) errors.TransientError (5) *exec.ExitError
                    Wraps: (8) Node 4. Command with error:
                      | ``````
                      | cd node_exporter &&
                      | sudo systemd-run --unit node_exporter --same-dir ./node_exporter
                      | ``````
                      | <no output>
                    Wraps: (9) attached stack trace
                      -- stack trace:
                      | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*remoteSession).errWithDebug
                      |     github.com/cockroachdb/cockroach/pkg/roachprod/install/session.go:143
                      | github.com/cockroachdb/cockroach/pkg/roachprod/install.(*remoteSession).CombinedOutput.func1
                      |     github.com/cockroachdb/cockroach/pkg/roachprod/install/session.go:156
                      | runtime.goexit
                      |     src/runtime/asm_amd64.s:1695
                    Wraps: (10) _potential_ SSH flake (``ssh -vvv`` log retained in /artifacts/c2c/disconnect/cpu_arch=arm64/run_1/ssh/ssh_094340.277696995_n4_cd-nodeexporter-sudo.log)
                    Wraps: (11) TRANSIENT_ERROR(ssh_problem)
                    Wraps: (12) exit status 255
                    Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *markers.withMark (4) *withstack.withStack (5) *errutil.withPrefix (6) *secondary.withSecondaryError (7) *secondary.withSecondaryError (8) *hintdetail.withDetail (9) *withstack.withStack (10) *errutil.withPrefix (11) errors.TransientError (12) *exec.ExitError
    Test:           c2c/disconnect
(require.go:1357).NoError: FailNow called
test artifacts and logs in: /artifacts/c2c/disconnect/cpu_arch=arm64/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

msbutler commented 1 day ago

hrm, in the failure from a week ago, setting this cluster setting failed, after the replication completed, due to a read connection reset:

error executing query="ALTER TENANT $1 SET CLUSTER SETTING sql.zone_configs.allow_for_secondary_tenant.enabled=true" args=["destination-tenant"]: read tcp 172.17.0.3:48366 -> 34.71.101.190:26257: read: connection reset by peer
(1) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/testutils/sqlutils.(*SQLRunner).ExecWithMessage
  |   github.com/cockroachdb/cockroach/pkg/testutils/sqlutils/sql_runner.go:99
  | github.com/cockroachdb/cockroach/pkg/testutils/sqlutils.(*SQLRunner).Exec
  |   github.com/cockroachdb/cockroach/pkg/testutils/sqlutils/sql_runner.go:88
  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.deprecatedStartInMemoryTenant
  |   github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/multitenant_utils.go:366
  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.(*replicationDriver).main
  |   github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1025
  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerClusterReplicationDisconnect.func1.2
  |   github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1721
  | main.(*monitorImpl).Go.func1
  |   main/pkg/cmd/roachtest/monitor.go:120
  | golang.org/x/sync/errgroup.(*Group).Go.func1
  |   golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:78
  | runtime.goexit
  |   src/runtime/asm_amd64.s:1695
Wraps: (2) secondary error attachment
  | read tcp 172.17.0.3:48366 -> 34.71.101.190:26257: read: connection reset by peer
  | (1) read tcp 172.17.0.3:48366 -> 34.71.101.190:26257
  | Wraps: (2) read
  | Wraps: (3) connection reset by peer
  | Error types: (1) *net.OpError (2) *os.SyscallError (3) syscall.Errno
Wraps: (3) error executing query="ALTER TENANT $1 SET CLUSTER SETTING sql.zone_configs.allow_for_secondary_tenant.enabled=true" args=["destination-tenant"]: read tcp 172.17.0.3:48366 -> 34.71.101.190:26257: read: connection reset by peer
Error types: (1) *withstack.withStack (2) *secondary.withSecondaryError (3) *errutil.leafError
msbutler commented 1 day ago

Oh no, this stream never replanned because stream_replication.lag_check_frequency doesn't actually do anything. I'll deal with this.

That lack of frequent replanning caused the lag to climb up to 10 minutes.

msbutler commented 1 day ago

The latest failure looks like an infra flake while starting grafana: will send this to test eng. https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/cluster_to_cluster.go#L619

2024/09/17 09:53:29 test_impl.go:420: test failure #1: full stack retained in failure_1.log: (assertions.go:363).Fail:
  Error Trace:  github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:619
                      github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cluster_to_cluster.go:1711
                      main/pkg/cmd/roachtest/test_runner.go:1255
                      src/runtime/asm_amd64.s:1695
  Error:        Received unexpected error:
                grafana-start currently cannot run on darwin: error persisted after 3 attempts: _potential_ SSH flake (`ssh -vvv` log retained in /artifacts/c2c/disconnect/cpu_arch=arm64/run_1/ssh/ssh_094340.277696995_n4_cd-nodeexporter-sudo.log): TRANSIENT_ERROR(ssh_problem): exit status 255
blathers-crl[bot] commented 1 day ago

cc @cockroachdb/test-eng

renatolabs commented 44 minutes ago

Instance of #131094, closing.