Open msbutler opened 2 months ago
The resumeContext passed to stepThroughStateMachine initially here: https://github.com/msbutler/cockroach/blob/butler-pcr-mixed-version/pkg/jobs/adopt.go#L451
resumeContext
stepThroughStateMachine
may get cancelled right after the state machine moves a job from running to pause requested here: https://github.com/msbutler/cockroach/blob/butler-pcr-mixed-version/pkg/jobs/registry.go#L1658
As soon as the job's status has moved to PauseRequested, even before the PauseRequested() func has returned to the client, the pause/cancel loop will cancel the resumer ctx here: https://github.com/msbutler/cockroach/blob/butler-pcr-mixed-version/pkg/jobs/adopt.go#L542
PauseRequested
PauseRequested()
We observed this is some unit test flakes: e.g. https://github.com/cockroachdb/cockroach/issues/128745#issuecomment-2296786847
The PauseRequested() will error after txn commit as it returns back to the client, explaining the error we see in test flakes
pq: aborted in DistSender: result is ambiguous: context canceled
A long term solution will may involve passing a different context to state machine operations.
Jira issue: CRDB-41604
Hi @msbutler, please add branch-* labels to identify which branch(es) this C-bug affects.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
cc @cockroachdb/disaster-recovery
The
resumeContext
passed tostepThroughStateMachine
initially here: https://github.com/msbutler/cockroach/blob/butler-pcr-mixed-version/pkg/jobs/adopt.go#L451may get cancelled right after the state machine moves a job from running to pause requested here: https://github.com/msbutler/cockroach/blob/butler-pcr-mixed-version/pkg/jobs/registry.go#L1658
As soon as the job's status has moved to
PauseRequested
, even before thePauseRequested()
func has returned to the client, the pause/cancel loop will cancel the resumer ctx here: https://github.com/msbutler/cockroach/blob/butler-pcr-mixed-version/pkg/jobs/adopt.go#L542We observed this is some unit test flakes: e.g. https://github.com/cockroachdb/cockroach/issues/128745#issuecomment-2296786847
The
PauseRequested()
will error after txn commit as it returns back to the client, explaining the error we see in test flakesA long term solution will may involve passing a different context to state machine operations.
Jira issue: CRDB-41604