backup: schedule PTS can advance too far in rare race with full backup

dt commented 1 month ago

PTS chaining for backup schedules works by having full backups make and drop PTS records while the incrementals that run between fulls just move the existing PTS up as they run.

For example, the full backup on Monday completes and creates PTS1. It records PTS1 in the incremental schedule so that every new incremental backup taken from then on, i.e. those that are appending to monday's full, will move the PTS1 up to the new time that is backed up by that inc backup.

When the Tuesday full backup completes, it'll delete PTS1 and create PTS2, recording PTS2 in the inc schedule so that new inc backups -- those that now extend the Tuesday chain -- will move PTS2 up as they go.

When each inc backup starts, it stores the ID of the PTS from the schedule that started it; when it finishes, that is the PTS that it will try to move up. However, if while that inc backup was running, the next full completed, the inc backup might find that by the time it completes the PTS it was supposed to update no longer exists. This is fine! It just means that there's a new chain with a new PTS now, that future inc backups will depend on and update, but that this was the last inc backup to use the old one and it can just exit quietly without moving anything up if it isn't there now.

Sounds great, right? Like, we obviously thought about the race condition of the full finishing while the inc backup was running, and handled it correctly, right?

... right?

Not quite.

When an inc backup starts up, it does a couple of things in Resume.

One is to determine where it will actually backup to/from, i.e. resolve previous backups and pick a chain to append to/determine the start time: https://github.com/cockroachdb/cockroach/blob/f3afb3c0dc8c1879286cae8979a7b8f90a24b173/pkg/ccl/backupccl/backup_job.go#L619

After that it does a little setup - drop a lockfile, check some stuff, etc -- and then it reads the schedule that created it to determine what PTS it will move forward when it completes: https://github.com/cockroachdb/cockroach/blob/f3afb3c0dc8c1879286cae8979a7b8f90a24b173/pkg/ccl/backupccl/backup_job.go#L728-L732

Unfortunately, some amount of time passes between these two points, where we determine which chain we will extend and then which PTS we will move.

So while we showed above that we handl the race condition where a full backup completes while an incremental is still backing up, that is only true if it happens while it backing up, but not while it is starting up, i.e. somewhere between these two lines of code. Specifically, if the full backup finishes after the inc backup resolves what it will extend, then the inc backup will have chosen to extend yesterday's full, but if that full then finishes after that but still before the inc reads its schedule record, a few lines later, to determine what PTS it will update when it in turn completes, then the inc backup may incorrectly, after picking yesterday's chain to extend, see the PTS ID of the new chain and record that as the PTS that it will move. There is very little that is done between these two points in the setup phase -- mostly just writing a nearly empty metadata file to the backup bucket to lock out any concurrent backups -- but it is a short, non-zero amount of time where a full completing would cause the inc to move the wrong PTS.

If multiple incremental backups have run since the full began (e.g. a full that takes, say, an hour and inc every 10min), an inc backup on the old chain that moves the new PTS could move that PTS to a time well ahead of the the time the full backed up. If the cluster uses a very short TTL, this then could mean that an inc backup to the new chain, from that new full, is no longer possible as the time that full completed to is no longer protected when the old chain's last inc backup incorrectly moved the PTS past it.

To fix this, we should be recording the PTS that the inc job will advance in its job record earlier -- ideally during job creation, in the same txn that read the schedule that is creating the job, but if not then, at least before we resolve the destination.

Again though, the window where a full would need to complete is very narrow here, so this race should be

blathers-crl[bot] commented 1 month ago

cc @cockroachdb/disaster-recovery

irvnet commented 1 month ago

Hi Team, I have a customer asking what release this might be available in.. any insight? / cc: @Leeeeeeeroy-Jenkins @Lukens4242

benbardin commented 1 month ago

We don't know yet, thank you for asking. I think we'll have a clearer idea of when we can schedule this work next week.

cockroachdb / cockroach

backup: schedule PTS can advance too far in rare race with full backup #128013