Open amruss opened 2 years ago
Discussion: 10 days instead of 3
@amruss dont we recommend people do full backups once a day and incrementals throughout the day? In that case if I am going to do a full backup in less in a day, why would we want to have a busted incremental linger for 3 or even 10 days?
I find the suggestion of 3d already somewhat surprising given these PTS records are left around on the error path -- an accrual of up to 3d of MVCC garbage for the entire keyspan (full backups for ex.) can be debilitating storage wise, effects on read latencies, etc. We can improve MVCC garbage observability and and admission control for this work, and that's all good, but I still question anything longer than single digit hours here. Certainly there will be users wanting to accomodate an incident response time of >= 3d, which configurability could let them do, I'm just questioning whether it should be made the default. +cc @williamkulju, @mwang1026.
If we had some way of making PTS records dramatically cheaper WRT them holding up garbage collection, say doing something like https://github.com/cockroachdb/cockroach/issues/86697, then perhaps a multi-day expiration time would be palatable; the data being kept around is O(live-bytes). But today it's O(live-bytes * time), a compounding we should consider in the defaults we ship with. +cc @nvanbenschoten.
Expiring a paused job is not intended to be the defense against mvcc garbage accumulation due to PTS records; note that garbage can accumulate for a variety of reasons other than a paused job (i.e. a non-paused job that is running slowly, is stuck, etc, or not even a job at all but rather a backup schedule that hasn't completed yet, etc), so the answer catching and fixing garbage accumulation is better kv obsv tools, like garbage bytes % alerts, displaying garbage amounts on the dbs and tables console page, etc.
Expiring paused jobs after 10 or 3 days is just there to catch something that is forgotten so it doesn't linger indefinitely, i.e. across major versions.
Moving this back to triage to discuss w/ cdc team
SRE just discussed this. Can we execute on this ticket ~now and include it in a patch update, with a cluster setting disabled by default if needed that we can flip on in CC immediately? The pause job if backup fails behavior implies the need for SRE to do a lot of regular operational work (read: toil) to keep CC backups from holding up PTO and spending lots of disk unnecessarily. Re: my suggestion to fix this ~now, in CC at least, I think landing pause instead of fail in 22.1 without also landing expiration for paused jobs should be considered a regression.
I'm not exactly sure how a patch release version of this would look, since we don't track when a job was paused right now, so existing, paused jobs don't reflect how long they've been paused. We'll need to start writing a new field at pause time to use for this, and any jobs that predate it will likely just be left alone?
There is nothing in the system that at least indirectly keeps track of when a job went from running to paused? Isn't there a hearbeated at column, or something like that?
I'm not too sure I will wind up adding job expiration; but I will add protected timestamp expiration.
I'm going to close this issue as complete per above PR.
@miretskiy, @dt: am I correct in reading #97148 only applies to protected timestamps left by changefeeds? But we also have backups that can be paused, and the motivating incidents above happened to due to PTS records left behind when backups were paused. I'm going to re-open.
Sure; but we will not be working on other job implementations... Perhaps we can move this to DR...
Paused jobs can be easily forgotten about. Long-paused jobs can cause cluster issues, because of the accumulation of garbage that paused jobs can create. Additionally, supporting long-paused jobs through cluster upgrades is hard.
We should:
Rational behind 3 days: you have enough working-day time to have done something about this
Follow up work:
Jira issue: CRDB-17756
Epic CRDB-20422