roachtest: c2c/shutdown/dest/coordinator failed

cockroach-teamcity commented 1 month ago

roachtest.c2c/shutdown/dest/coordinator failed with artifacts on release-24.1 @ e5b1d125bf8cde9b4d47f4303b8a76ec735ca082:

(latency_verifier.go:198).assertValid: max latency was more than allowed: 2m0.290093216s vs 2m0s
test artifacts and logs in: /artifacts/c2c/shutdown/dest/coordinator/run_1

Parameters:

ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=8
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_runtimeAssertionsBuild=false
ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

- #131165 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.2] - #128742 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.1.3-rc]

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-43245

cockroach-teamcity commented 1 month ago

roachtest.c2c/shutdown/dest/coordinator failed with artifacts on release-24.1 @ 39bae3f4961c14c890d6140e9268d2fbf0ca324a:

(latency_verifier.go:198).assertValid: max latency was more than allowed: 2m0.852996135s vs 2m0s
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/shutdown/dest/coordinator/run_1

Parameters:

ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=8
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_runtimeAssertionsBuild=false
ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

- #131165 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.2] - #128742 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.1.3-rc]

This test on roachdash | Improve this report!

msbutler commented 1 month ago

There may actually be smoke here. the max latency from this shutdown isn't actually a hair over 2 minutes-- rather, the latency verifier exits as soon as it sees latency above 2 minutes....

To construct a timeline of this test:

6:47:47 node shutdown occurs on node 7

06:47:48 monitor.go:203: Monitor event: n7: error: no cockroach processes running

6:48:46 node 5 picks up the job, i.e. a minute after node 7 shut the job down. the base adoption interval is 30 seconds, so i'm not sure why a node didn't pick up the job after the first 30 seconds had passed.

5.unredacted/cockroach.log:I241016 06:48:46.980619 25353 ccl/streamingccl/streamingest/stream_ingestion_job.go:408 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=1012501106125406211›] 520  empty cutover time, no revert required
5.unredacted/cockroach.log:I241016 06:48:46.986365 25353 ccl/streamingccl/streamingest/stream_ingestion_job.go:93 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=1012501106125406211›] 521  resuming stream (producer job 1012501106433556484) from 1729061216.168387056,0

Furthermore, by the time node 5 picks up the job, the replication lag is already at 1m:47s, so the stream only has 13 seconds to complete it's catchups scans, before the latency verifier fails

06:48:43 latency_verifier.go:140: test status: watching stream-ingestion: end-to-end steady latency 1m47.788s; max steady latency so far 1m47.788s; highwater 2024-10-16 06:46:56.168387056 +0000 UTC

Given that cutover succeeded, the stream was able to catch up after the node shutdown.

Next steps i think are:

understand why the job wasn't adopted after 1 30s interval
consider bumping the max latency in the test or changing how the latency verifier works.

msbutler commented 1 month ago

aha! the default session liveness is 40 seconds, so after the node was sigkilled, it's session could not be destroyed until after the second adoption loop! https://github.com/msbutler/cockroach/blob/butler-remove-deprecated-restore-checkpointing/pkg/sql/sqlliveness/slbase/slbase.go#L20

cockroachdb / cockroach

roachtest: c2c/shutdown/dest/coordinator failed #132730