cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.16k stars 3.82k forks source link

jobs: remove re-resume delay #135825

Closed dt closed 22 hours ago

dt commented 1 day ago

Previously the jobs system would count how many times a job had been resumed as well as when it most recent was resumed and 'backoff' running resume if a job had been resumed many times. This behavior, however, has routinely caused problems for several jobs, in particular those that run forever and can thus have a large number of times they have been resumed when they move about a cluster as nodes restart, which is perfectly fine. If a given job wishes to hold off on executing for some reason, that really should be up to that job and the jobs system should be invoking that job's resumer so that it can make that decision on its own, rather than having a job that claims to be 'running' and has a node holding its adoption claim, but is not invoked on that node.

Release note: none. Epic: none.

cockroach-teamcity commented 1 day ago

This change is Reviewable

asg0451 commented 1 day ago

Looks fine to me from the changefeed side

blathers-crl[bot] commented 1 day ago

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

stevendanna commented 1 day ago

do you remember why we added backoff in the first place? like, are there jobs that we should teach to back off?

Looking at some of the git history, at least one motivation was limiting the impact of crashing-bugs in jobs: https://github.com/cockroachdb/cockroach/issues/44594

dt commented 23 hours ago

TFTR!

bors r+

craig[bot] commented 22 hours ago

Build succeeded: