cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.19k stars 3.82k forks source link

roachtest: c2c/shutdown/dest/coordinator failed #132730

Open cockroach-teamcity opened 1 month ago

cockroach-teamcity commented 1 month ago

roachtest.c2c/shutdown/dest/coordinator failed with artifacts on release-24.1 @ e5b1d125bf8cde9b4d47f4303b8a76ec735ca082:

(latency_verifier.go:198).assertValid: max latency was more than allowed: 2m0.290093216s vs 2m0s
test artifacts and logs in: /artifacts/c2c/shutdown/dest/coordinator/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

- #131165 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.2] - #128742 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.1.3-rc]

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-43245

cockroach-teamcity commented 1 month ago

roachtest.c2c/shutdown/dest/coordinator failed with artifacts on release-24.1 @ 39bae3f4961c14c890d6140e9268d2fbf0ca324a:

(latency_verifier.go:198).assertValid: max latency was more than allowed: 2m0.852996135s vs 2m0s
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/shutdown/dest/coordinator/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

- #131165 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.2] - #128742 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.1.3-rc]

This test on roachdash | Improve this report!

msbutler commented 1 month ago

There may actually be smoke here. the max latency from this shutdown isn't actually a hair over 2 minutes-- rather, the latency verifier exits as soon as it sees latency above 2 minutes....

To construct a timeline of this test:

Given that cutover succeeded, the stream was able to catch up after the node shutdown.

Next steps i think are:

msbutler commented 1 month ago

aha! the default session liveness is 40 seconds, so after the node was sigkilled, it's session could not be destroyed until after the second adoption loop! https://github.com/msbutler/cockroach/blob/butler-remove-deprecated-restore-checkpointing/pkg/sql/sqlliveness/slbase/slbase.go#L20