cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.8k forks source link

jobs: misleading error can return to user when job enters paused state #129588

Open msbutler opened 2 months ago

msbutler commented 2 months ago

The resumeContext passed to stepThroughStateMachine initially here: https://github.com/msbutler/cockroach/blob/butler-pcr-mixed-version/pkg/jobs/adopt.go#L451

may get cancelled right after the state machine moves a job from running to pause requested here: https://github.com/msbutler/cockroach/blob/butler-pcr-mixed-version/pkg/jobs/registry.go#L1658

As soon as the job's status has moved to PauseRequested, even before the PauseRequested() func has returned to the client, the pause/cancel loop will cancel the resumer ctx here: https://github.com/msbutler/cockroach/blob/butler-pcr-mixed-version/pkg/jobs/adopt.go#L542

We observed this is some unit test flakes: e.g. https://github.com/cockroachdb/cockroach/issues/128745#issuecomment-2296786847

The PauseRequested() will error after txn commit as it returns back to the client, explaining the error we see in test flakes

 pq: aborted in DistSender: result is ambiguous: context canceled

A long term solution will may involve passing a different context to state machine operations.

Jira issue: CRDB-41604

blathers-crl[bot] commented 2 months ago

Hi @msbutler, please add branch-* labels to identify which branch(es) this C-bug affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] commented 2 months ago

cc @cockroachdb/disaster-recovery