restore: handle online restore failures after the linking phase

msbutler commented 5 months ago

If an online restore fails after the linking phase, the restore job coordinator will need to rollback the key space to a pre restore state, even if the user has already begun a foreground workload. As described in this design doc, the restore coordinator will need to send some sort of ClearRange-ish request that instructs pebble to compact away all data written by the restore job. We also need to ensure that all reads in the reverting key space fail quickly.

Jira issue: CRDB-35662

blathers-crl[bot] commented 5 months ago

Hi @msbutler, please add branch-* labels to identify which branch(es) this GA-blocker affects.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

blathers-crl[bot] commented 5 months ago

Hi @msbutler, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

stevendanna commented 5 months ago

If we end up shipping with an "online" portion of online restore, then it seems to me like that we might have to differentiate the case where we've published online descriptors and have possibly accepted user writes in that keyspan. After this point, I don't think we want to issue any command that might delete user data? I think at that point perhaps the only think we'll be able to do is to log.Fatal and provide some recovery tools.

If we haven't published descriptors, it would be nice if we could make the normal GC job capable of handling this. That is, it would be nice to lay down our normal deletes and for those to be enough even in the face of an unavailable file covered by the delete. Perhaps there is some reason that this isn't possible.

That said, once we've linked a bad SST, at the moment it seems like we are off to the races and just hoping we can clean up before something fails. Because who is to say we are the first thing that is going to touch that span. Perhaps some KV queue gets to it first.

blathers-crl[bot] commented 4 months ago

cc @cockroachdb/disaster-recovery

dt commented 3 months ago

We said we were happy having this be a debug tool we add later; unmarking as blocker.

cockroachdb / cockroach

restore: handle online restore failures after the linking phase #118283