Open cockroach-teamcity opened 1 month ago
roachtest.c2c/shutdown/dest/coordinator failed with artifacts on release-24.1 @ 39bae3f4961c14c890d6140e9268d2fbf0ca324a:
(latency_verifier.go:198).assertValid: max latency was more than allowed: 2m0.852996135s vs 2m0s
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/shutdown/dest/coordinator/run_1
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=8
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_runtimeAssertionsBuild=false
ROACHTEST_ssd=0
See: roachtest README
See: How To Investigate (internal)
See: Grafana
- #131165 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.2] - #128742 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.1.3-rc]
There may actually be smoke here. the max latency from this shutdown isn't actually a hair over 2 minutes-- rather, the latency verifier exits as soon as it sees latency above 2 minutes....
To construct a timeline of this test:
6:47:47 node shutdown occurs on node 7
06:47:48 monitor.go:203: Monitor event: n7: error: no cockroach processes running
6:48:46 node 5 picks up the job, i.e. a minute after node 7 shut the job down. the base adoption interval is 30 seconds, so i'm not sure why a node didn't pick up the job after the first 30 seconds had passed.
5.unredacted/cockroach.log:I241016 06:48:46.980619 25353 ccl/streamingccl/streamingest/stream_ingestion_job.go:408 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=1012501106125406211›] 520 empty cutover time, no revert required
5.unredacted/cockroach.log:I241016 06:48:46.986365 25353 ccl/streamingccl/streamingest/stream_ingestion_job.go:93 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=1012501106125406211›] 521 resuming stream (producer job 1012501106433556484) from 1729061216.168387056,0
Furthermore, by the time node 5 picks up the job, the replication lag is already at 1m:47s, so the stream only has 13 seconds to complete it's catchups scans, before the latency verifier fails
06:48:43 latency_verifier.go:140: test status: watching stream-ingestion: end-to-end steady latency 1m47.788s; max steady latency so far 1m47.788s; highwater 2024-10-16 06:46:56.168387056 +0000 UTC
Given that cutover succeeded, the stream was able to catch up after the node shutdown.
Next steps i think are:
aha! the default session liveness is 40 seconds, so after the node was sigkilled, it's session could not be destroyed until after the second adoption loop! https://github.com/msbutler/cockroach/blob/butler-remove-deprecated-restore-checkpointing/pkg/sql/sqlliveness/slbase/slbase.go#L20
roachtest.c2c/shutdown/dest/coordinator failed with artifacts on release-24.1 @ e5b1d125bf8cde9b4d47f4303b8a76ec735ca082:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=8
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_runtimeAssertionsBuild=false
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
Same failure on other branches
- #131165 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.2] - #128742 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.1.3-rc]
/cc @cockroachdb/disaster-recovery
This test on roachdash | Improve this report!
Jira issue: CRDB-43245