cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.22k stars 3.82k forks source link

roachtest: restore/pause/tpce/15GB/aws/nodes=4/cpus=8 failed #136348

Closed cockroach-teamcity closed 6 hours ago

cockroach-teamcity commented 6 days ago

roachtest.restore/pause/tpce/15GB/aws/nodes=4/cpus=8 failed with artifacts on master @ f2ce52ff7cc3b4680e0fec2ef844baad49042d6a:

(soon.go:60).SucceedsWithin: condition failed to evaluate within 45s: from restore.go:206: expected status ``paused`` but found pause-requested
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/restore/pause/tpce/15GB/aws/nodes=4/cpus=8/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-44982

cockroach-teamcity commented 4 days ago

roachtest.restore/pause/tpce/15GB/aws/nodes=4/cpus=8 failed with artifacts on master @ bcc993d796d03664604bf695e38fd5644d0bc952:

(soon.go:60).SucceedsWithin: condition failed to evaluate within 45s: from restore.go:206: expected status ``paused`` but found pause-requested
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/restore/pause/tpce/15GB/aws/nodes=4/cpus=8/run_1

Parameters:

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

This test on roachdash | Improve this report!

jeffswenson commented 1 day ago

By the time the debug zip was taken, the job reached a paused state:

1024699141675843585 RESTORE RESTORE DATABASE tpce FROM '/2023/03/14-170004.31' IN 's3://cockroach-fixtures-us-east-2/backups/tpc-e/customers=1000/v22.2.1/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2023-03-14T17:55:00Z' WITH OPTIONS (detached, unsafe_restore_incompatible_version)     root    {129,134,130,107,123,105,136,128,122,118,119,108,125,114,104,112,126,135,124,111,133,113,137,127,131,116,117,132,109,121,115,138,120,110,106}   paused  NULL    2024-11-28 08:49:20.609088+00   2024-11-28 08:49:40.843239+00   NULL    2024-11-28 08:49:43.253125+00   0.2629108130931854  NULL        NULL    4590080351616260807 NULL    NULL

I think we are seeing this flake because #135191 effectively changed how long the test waits for the job from 1 minute and 45 seconds to 45 seconds.

jeffswenson commented 1 day ago

Reading through the job system code, I was confused as to why this is taking more than 45 seconds to reach the paused state. Pausing is handled by the pause/cancel loop which is supposed to run every 10 seconds.

https://github.com/cockroachdb/cockroach/blob/b499b3a5e08d3b10d4d542ddb2f100f35838566f/pkg/jobs/config.go#L36

I think what is happening is before the pause-requested is served, the job errors out trying to update its status. Here's the relevant log line:

E241128 08:49:46.951985 2761 jobs/adopt.go:414 ⋮ [T1,Vsystem,n3,job=1024699141675843585] 208  job 1024699141675843585: adoption completed with error ‹cannot update progress on pause-requested job (id 1024699141675843585)›

So this delays pausing until something attempts to adopt the job, which happens up to 30 seconds later.