cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.84k stars 3.77k forks source link

jobs: break up pts management poller #118512

Closed msbutler closed 6 months ago

msbutler commented 7 months ago

The job system contains an internal poller that scans and cancels active jobs which contain a pts that is older than a specified limit. We discovered in a support ticket that the poller uses a single txn to scan over much of the job table, pts table, and job info table and also cancel any job (i.e. update the job table) with a pts that is too old. This long running txn is prone to conflicting with other txns that touch the same key space. In this support ticket, we observed this txn being perpetually pushed, causing deadlock on the job system. To fix, we ought break this txn into much smaller txns, perhaps one for each job.

Jira issue: CRDB-35764

blathers-crl[bot] commented 7 months ago

Hi @msbutler, please add branch-* labels to identify which branch(es) this release-blocker affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

stevendanna commented 7 months ago

I wonder how bad this transaction would have been if not in the presence of the other bugs we found in that support ticket. However, it still seems prudent to restructure this so that we aren't involving so many keys related to different jobs in the same transaction.

msbutler commented 6 months ago

closing, as https://github.com/cockroachdb/cockroach/pull/118979 has merged