backube / volsync

Asynchronous data replication for Kubernetes volumes
https://volsync.readthedocs.io
GNU Affero General Public License v3.0
566 stars 65 forks source link

Option to NOT run backup as soon as the replicationsource is applied to the cluster #627

Open onedr0p opened 1 year ago

onedr0p commented 1 year ago

Describe the feature you'd like to have.

Hi 👋🏼

Please add a configuration option to not have the backups run when the replicationsource is applied to the cluster.

What is the value to the end user? (why is it a priority?)

When bootstrapping a new cluster using GitOps the replicationsource is applied and will start backing up data right away. This isn't really ideal since I want to recover from a previous backup.

How will we know we have a good solution? (acceptance criteria)

An option is added to prevent backups being taken as soon as a replicationsource is applied to the cluster.

Additional context

budimanjojo commented 1 year ago

Maybe implementing Point In Time Recovery will solve this problem too if this can be implemented in Volsync. But it requires the method to supports incremental backups (I believe restic does?) So we can restore the data using timestamp instead of specifying which backup to restore.

But I'm fine with having option not to backup on creation like this issue describes too if this is too hard to implement.

tesshuflower commented 1 year ago

@onedr0p If you're looking for something you can do right now - you could take a look at spec.paused = true in the replicationsource spec.

Something like:

apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: my-repl-source
spec:
  paused: true
  sourcePVC: mypvc
...

What this will do is still create a job for the replicationsource backup, however the job will run with parallelism = 0 so no actual pod will run until you remove paused from the spec or set it to false.

However, be aware that if you're using a copyMethod of Snapshot or Clone, it means the snapshot/clone will still be taken immediately - so when you do unpause later on, this first backup will still be of the (empty?) pvc when you initially created the replicationsource.

onedr0p commented 1 year ago

@tesshuflower that isn't really all that ideal because I would have to edit a ton of ReplicationSources in a GitOps repo and have paused: true when restoring my clusters and then updating them all again to paused: false when the clusters are bootstrapped and apps are restored. However like you said since I am using copyMethod : Snapshot when unpaused the restic repo will have a backup of stale data that I don't need.

I am curious on your thoughts if there could be a global option for the controller to handle this?

helm values

    # When a ReplicationSource is created run a backup immediately
    backupImmediately: false
    manageCRDs: true
    metrics:
      disableAuth: true

With that option set I could always do a restic restore using previous: 1 because it's predictable. Currently I use previous: 2 when restoring a while cluster from a previous state. However I have to wait for the ReplicationSources to finishing backing up the data I don't need or I could get not the latest backup.

JohnStrunk commented 1 year ago

It could be implemented as a global option that causes CRs w/o a status to get populated w/ a lastSyncTime=now(). That would delay the syncing to the next scheduled time. It could also be added to the scheduling portion of the CR to accomplish the same thing.

I'm not sure that all users on a cluster would necessarily agree about what the create-time behavior should be (controller option). There could also be problems where the schedule happens around the same time as the object is created, leading to a delay that isn't long enough.

I'm wondering if there's perhaps another way to accomplish the sequencing you want. Could you describe your setup a bit more?

onedr0p commented 1 year ago

@JohnStrunk

I run a kubernetes cluster at home with GitOps/FluxCD and sometimes find myself nuking the cluster and re-provisioning it. This means I use volsync to backup and restore the data of my stateful applications when that happens.

For example I use Frigate for my NVR and it has a sqlite database I need backed up. In this folder I have volsync.yaml which is the replicationsource written declaratively and managed by Flux.

Now let's say I nuke my cluster, when I go back to re-provision it Flux will deploy Frigate and also apply that replicationsource causing a backup to kick off of data that I don't care about.

Now on the restore side, I have a pretty wild taskfile (think makefile but in YAML) that will do a bunch of steps to restore the data like suspend the flux resources, scale down the deployments or statefulsets, delete the stale data in the PVC (https://github.com/backube/volsync/issues/555), and apply the replicationdestination to restore the data. When that is all done the taskfile will then resume the flux resources and scale everything back up and everyone is happy including my wife 😄

onedr0p commented 1 year ago

@JohnStrunk It would be kind of neat to do this the way cloudnative-pg handles it with an .spec.immediate: true option on the ReplicationSource

https://cloudnative-pg.io/documentation/1.19/backup_recovery/#scheduled-backups

I can try to PR this if you are happy with that solution, any pointers on implementation details you would like to see would be welcomed. Thanks!

tesshuflower commented 1 year ago

@onedr0p I was discussing this with @JohnStrunk and he had an idea that might give you a more robust solution, assuming we understand your use-case correctly.

If we simply skip the 1st sync on creation we still have the issue that depending on when you deploy all your CRs, it might still try to do a backup (before your app has started) if the cron schedule happens to be around that time.

We had thought you were perhaps creating a replicationdestination in direct mode to your PVC to restore your data at the same time creating a replicationsource with that same PVC as the sourcePVC. If this is the case there are some potential options with the new volumepopulator feature that would perhaps help and avoid the timing issue I mentioned above.

However re-reading the above, are you hoping simply to skip the 1st sync while your app deploys for the first time?

onedr0p commented 1 year ago

If we simply skip the 1st sync on creation we still have the issue that depending on when you deploy all your CRs, it might still try to do a backup (before your app has started) if the cron schedule happens to be around that time.

Generally for me that would likely never happen. I backup at night/early morning when I would never be redeploying my cluster.

However re-reading the above, are you hoping simply to skip the 1st sync while your app deploys for the first time?

Yes that would be ideal, I've seen other backup tools offer this option to skip the first backup and then only run on schedule so it would be nice to have an option for this here too.

tesshuflower commented 1 year ago

If we simply skip the 1st sync on creation we still have the issue that depending on when you deploy all your CRs, it might still try to do a backup (before your app has started) if the cron schedule happens to be around that time.

Generally for me that would likely never happen. I backup at night/early morning when I would never be redeploying my cluster.

However re-reading the above, are you hoping simply to skip the 1st sync while your app deploys for the first time?

Yes that would be ideal, I've seen other backup tools offer this option to skip the first backup and then only run on schedule so it would be nice to have an option for this here too.

@onedr0p I guess we were assuming you needed to delay the 1st sync because you were putting down all the yamls including a replicationdestination that would restore a PVC at the same time as you want to put a replicationsource that would start the backups of that PVC.

Is the above not really necessary in your setup? You are looking to deploy a new empty pvc and app, and just want to start the backups but not until the 1st scheduled sync (when presumably your app is up and has written initial data).

onedr0p commented 1 year ago

I guess we were assuming you needed to delay the 1st sync because you were putting down all the yamls including a replicationdestination that would restore a PVC at the same time as you want to put a replicationsource that would start the backups of that PVC.

I described my full cluster restore process in this comment (with the downsides and why this feature would be nice to have) so I am unsure if you need me to put this into better words or not but I'll try.

My current process for full cluster backup/restore

  1. Force a backup using this task (this ensures the latest data has been backed up)
  2. Verify all apps were backed up and then destroy the cluster.
  3. Re-deploy my cluster, apps (the only volsync resource applied here is the ReplicationSource's) and then I restore the apps using this task. I do not apply the ReplicationDestination with GitOps/Flux since doing that would make the process much harder to do.

The downsides to this process are that during step (3) as soon ReplicationSource is applied it kicks off a backup of the initial app data which I don't need (this creates cruft in the Restic repo) and the other downside is that there is no way to wipe the PVC before restore which I have this task for and opened a feature request for here.

Maybe we're having a hard time communicating this over text? I am down for a quick voice chat to get anything cleared up if needed.

JohnStrunk commented 1 year ago

Your use case seems like something we should support, and I'd like to be able to do it in a robust way w/o extensive scripting on the user's part. Would the following help?

If you use the volume populator that was recently added, during cluster restore you could:

This should cause:

The downside is that you'll get an initial backup that is identical to the one you just restored. However, I don't think the above will require external sequencing (or concern over the timing relative to the cronspec)

onedr0p commented 1 year ago

I'm definitely interested in giving the volume populator method a try, although I'm curious how that will work with statefulsets and volume claim templates. Do you think that should also cover that usecase?

tesshuflower commented 1 year ago

I'll try to explain it a bit - I'm supposed to be writing documentation for the volume populator so that should be coming soon - for now here's a high level overview:

The way the volume populator works, you can create a PVC with a replicationdestination as the dataSourceRef. The volsync volume populator can then populate this PVC with the contents of the backup. You would need to use copyMethod: Snapshot in your replicationdestination.

Like John mentioned, this does mean the 1st backup will probably still contain the same contents as what was just restored.

When it comes to stateful sets, you may just need to be careful about naming your PVC to the name it's expecting but startup of stateful sets should re-use existing PVCs so I don't think it should be an issue.

onedr0p commented 1 year ago

@tesshuflower that makes sense, I suppose this issue can wait until that feature is released and documented. I'll circle back then with giving the new method a shot instead of my hacky scripts. Thanks for taking the time on explaining all that, I look forward to giving it a shot.

tesshuflower commented 1 year ago

@onedr0p I have a PR up with my 1st pass at documenting volume populator if you want to take a look: https://volsync--833.org.readthedocs.build/en/833/

If you have comments/suggestions about the content itself, feel free to add comments directly in the PR: https://github.com/backube/volsync/pull/833

onedr0p commented 1 year ago

Over all it looks good but I am not sure how this helps out when you completely nuke your cluster and then want to re-provision since the VolumeSnapshots won't exist on the new cluster. The data will only exist in the restic repo.

tesshuflower commented 1 year ago

@onedr0p on a re-provisioned cluster it should be something like this:

Now the PVC will remain in pending state until the ReplicationDestination has finished and has created a latestImage (i.e. pulled the data down from your restic repo, written a volumesnapshot). Once the replicationdestination is done, the PVC will be populated with the contents of that volumesnapshot and the PVC will become ready to use, at which point the pod for the app that's trying to mount the PVC can start.

onedr0p commented 1 year ago

That sounds awesome, I'll be sure to give it a shot once the new version of volsync is released!

onedr0p commented 11 months ago

Volume Populator is indeed awesome and it really simplifies bootstrapping a cluster after a disaster. Thanks ❤️

While being a low priority, I think having the feature describe here would still be nice because once the Volume Populator runs and the app comes online, a backup is made immediately so there's still a needless backup made.

tesshuflower commented 11 months ago

Understood - I think the main hesitation at the moment to implement something like this is that we essentially don't want to break the existing scheduling with changes - and there's still no guarantee you don't make this backup earlier than you'd like depending on your schedule.