Closed cockroach-teamcity closed 6 hours ago
roachtest.restore/pause/tpce/15GB/aws/nodes=4/cpus=8 failed with artifacts on master @ bcc993d796d03664604bf695e38fd5644d0bc952:
(soon.go:60).SucceedsWithin: condition failed to evaluate within 45s: from restore.go:206: expected status ``paused`` but found pause-requested
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/restore/pause/tpce/15GB/aws/nodes=4/cpus=8/run_1
Parameters:
arch=amd64
cloud=aws
coverageBuild=false
cpu=8
encrypted=false
fs=ext4
localSSD=false
runtimeAssertionsBuild=false
ssd=0
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for aws clusters
By the time the debug zip was taken, the job reached a paused state:
1024699141675843585 RESTORE RESTORE DATABASE tpce FROM '/2023/03/14-170004.31' IN 's3://cockroach-fixtures-us-east-2/backups/tpc-e/customers=1000/v22.2.1/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2023-03-14T17:55:00Z' WITH OPTIONS (detached, unsafe_restore_incompatible_version) root {129,134,130,107,123,105,136,128,122,118,119,108,125,114,104,112,126,135,124,111,133,113,137,127,131,116,117,132,109,121,115,138,120,110,106} paused NULL 2024-11-28 08:49:20.609088+00 2024-11-28 08:49:40.843239+00 NULL 2024-11-28 08:49:43.253125+00 0.2629108130931854 NULL NULL 4590080351616260807 NULL NULL
I think we are seeing this flake because #135191 effectively changed how long the test waits for the job from 1 minute and 45 seconds to 45 seconds.
Reading through the job system code, I was confused as to why this is taking more than 45 seconds to reach the paused state. Pausing is handled by the pause/cancel loop which is supposed to run every 10 seconds.
https://github.com/cockroachdb/cockroach/blob/b499b3a5e08d3b10d4d542ddb2f100f35838566f/pkg/jobs/config.go#L36
I think what is happening is before the pause-requested is served, the job errors out trying to update its status. Here's the relevant log line:
E241128 08:49:46.951985 2761 jobs/adopt.go:414 ⋮ [T1,Vsystem,n3,job=1024699141675843585] 208 job 1024699141675843585: adoption completed with error ‹cannot update progress on pause-requested job (id 1024699141675843585)›
So this delays pausing until something attempts to adopt the job, which happens up to 30 seconds later.
roachtest.restore/pause/tpce/15GB/aws/nodes=4/cpus=8 failed with artifacts on master @ f2ce52ff7cc3b4680e0fec2ef844baad49042d6a:
Parameters:
arch=amd64
cloud=aws
coverageBuild=false
cpu=8
encrypted=false
fs=ext4
localSSD=false
runtimeAssertionsBuild=false
ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for aws clusters
/cc @cockroachdb/disaster-recoveryThis test on roachdash | Improve this report!
Jira issue: CRDB-44982