jobs: add expiration for paused jobs

cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

https://www.cockroachlabs.com

Other

30.16k stars 3.82k forks source link

jobs: add expiration for paused jobs #84598

Open amruss opened 2 years ago

amruss commented 2 years ago

Paused jobs can be easily forgotten about. Long-paused jobs can cause cluster issues, because of the accumulation of garbage that paused jobs can create. Additionally, supporting long-paused jobs through cluster upgrades is hard.

We should:

Expire paused jobs after 3 days of being paused, add expiration time as a column in jobs table
If a job is unpaused then paused, the expiration counter should reset

Rational behind 3 days: you have enough working-day time to have done something about this

Follow up work:

Make the default pause limit configurable (https://github.com/cockroachdb/cockroach/issues/84600)
Display expiration time in DB Console (https://github.com/cockroachdb/cockroach/issues/84599)
Add the ability to extend this time https://github.com/cockroachdb/cockroach/issues/85978

Jira issue: CRDB-17756

Epic CRDB-20422

amruss commented 2 years ago

Discussion: 10 days instead of 3

lunevalex commented 2 years ago

@amruss dont we recommend people do full backups once a day and incrementals throughout the day? In that case if I am going to do a full backup in less in a day, why would we want to have a busted incremental linger for 3 or even 10 days?

irfansharif commented 2 years ago

I find the suggestion of 3d already somewhat surprising given these PTS records are left around on the error path -- an accrual of up to 3d of MVCC garbage for the entire keyspan (full backups for ex.) can be debilitating storage wise, effects on read latencies, etc. We can improve MVCC garbage observability and and admission control for this work, and that's all good, but I still question anything longer than single digit hours here. Certainly there will be users wanting to accomodate an incident response time of >= 3d, which configurability could let them do, I'm just questioning whether it should be made the default. +cc @williamkulju, @mwang1026.

irfansharif commented 2 years ago

If we had some way of making PTS records dramatically cheaper WRT them holding up garbage collection, say doing something like https://github.com/cockroachdb/cockroach/issues/86697, then perhaps a multi-day expiration time would be palatable; the data being kept around is O(live-bytes). But today it's O(live-bytes * time), a compounding we should consider in the defaults we ship with. +cc @nvanbenschoten.

dt commented 2 years ago

Expiring a paused job is not intended to be the defense against mvcc garbage accumulation due to PTS records; note that garbage can accumulate for a variety of reasons other than a paused job (i.e. a non-paused job that is running slowly, is stuck, etc, or not even a job at all but rather a backup schedule that hasn't completed yet, etc), so the answer catching and fixing garbage accumulation is better kv obsv tools, like garbage bytes % alerts, displaying garbage amounts on the dbs and tables console page, etc.

Expiring paused jobs after 10 or 3 days is just there to catch something that is forgotten so it doesn't linger indefinitely, i.e. across major versions.

amruss commented 2 years ago

Moving this back to triage to discuss w/ cdc team

joshimhoff commented 1 year ago

SRE just discussed this. Can we execute on this ticket ~now and include it in a patch update, with a cluster setting disabled by default if needed that we can flip on in CC immediately? The pause job if backup fails behavior implies the need for SRE to do a lot of regular operational work (read: toil) to keep CC backups from holding up PTO and spending lots of disk unnecessarily. Re: my suggestion to fix this ~now, in CC at least, I think landing pause instead of fail in 22.1 without also landing expiration for paused jobs should be considered a regression.

dt commented 1 year ago

I'm not exactly sure how a patch release version of this would look, since we don't track when a job was paused right now, so existing, paused jobs don't reflect how long they've been paused. We'll need to start writing a new field at pause time to use for this, and any jobs that predate it will likely just be left alone?

joshimhoff commented 1 year ago

There is nothing in the system that at least indirectly keeps track of when a job went from running to paused? Isn't there a hearbeated at column, or something like that?

miretskiy commented 1 year ago

I'm not too sure I will wind up adding job expiration; but I will add protected timestamp expiration.

miretskiy commented 1 year ago

I'm going to close this issue as complete per above PR.

irfansharif commented 1 year ago

@miretskiy, @dt: am I correct in reading #97148 only applies to protected timestamps left by changefeeds? But we also have backups that can be paused, and the motivating incidents above happened to due to PTS records left behind when backups were paused. I'm going to re-open.

miretskiy commented 1 year ago

Sure; but we will not be working on other job implementations... Perhaps we can move this to DR...