Coordination with Velero

alexander-bauer commented 1 year ago

Hi @elemental-lf, to cut to the chase, I'm looking for help while I write some tooling. I'm hoping to put together something robust and useful enough that other folks can take advantage as well, and I want to make sure I'm on the right track before I invest too much time.

Background and Motivation

I have an existing (homelab) Kubernetes cluster that is ultimately backed by Ceph, managed by Rook. The storage is attached to the same execution nodes. Right now, I use Velero to schedule and manage my backups - this is a tool largely aimed at making Kubernetes resources restorable to the same or other clusters, and is not especially interested in backups, per se -- it can be made to create VolumeSnapshots (and VolumeSnapshotContents) using CSI, but doesn't muck with exporting those snapshots.

I think that approach makes a lot of sense for managed Kubernetes instances, or instances backed by a robust storage medium with its own replication capabilities. Indeed, it would be great if I had a couple dozen hosts to handle replication at the Ceph level.

As far as off-medium (and off-site) backups, it's a lot more cost effective to run a standalone Minio pod backed by an external drive, and call that "archival object storage." (Especially with something like rclone to Backblaze B2 for off-site replication.)

Benji strikes me as well-organized and well-regarded, and as operating at the perfect layer to fill in the gap for me:

very intelligent RBD backups
S3 as an archival layer

Where the Existing Tooling Clashes

As far as I'm able to tell, a typical scheduled Velero backup grabs copies of most Kubernetes resources (as returned by the API), and serializes them to S3. PVCs are special: Velero optionally injects a command to the attached pod (such as fsfreeze), asks CSI to create a VolumeSnapshot, waits for it to complete (and for the VolumeSnapshotContent to be available), then injects another pod command, and then serializes the VolumeSnapshot and VolumeSnapshotContent objects to archive.

Of course, the VolumeSnapshotContent contains a reference to the saved data on the underlying storage layer (in this case, RBD), but that data is not replicated to archive.

Benji is very well positioned to take that backup model and shore up the final piece: replicating the underlying RBD snapshot to archival storage.

Where the clash with existing tooling is, is that the existing scripts seem tuned for Benji as the primary backup provider, with it responsible for freezing the filesystems, taking the snapshots, and for managing their lifecycles.

Proposition (i.e. please help me do this)

I think that there's no fundamental disagreement here, just a need for a for-purpose script. Ideally, it'd be one general and robust enough to be included in Benji's standard distribution, with associate documentation to help out anyone who may be following down the same path that I am.

So: I want to write a script along the lines of the existing backup_pvc.py, which crawls through existing VolumeSnapshot objects, and ensures that any corresponding VolumeSnapshotContent (or, at least the ones on RBD) are replicated to archival storage.

I think this should be as simple as:

list every VolumeSnapshot
for each VolumeSnapshot, find every corresponding VolumeSnapshotContent (the current one appears to be linked in the status field, but former ones may continue to exist, I think.)
for each VolumeSnapshotContent, look up the RBD volume and snapshot it corresponds to; if the volume has previously been backed up, perform a differential backup against the previous snapshot; otherwise, do a full backup.

Where I expect to find issues:

Deciding when old versions can be cleaned.

With Velero (or some other backup tool with a similar approach) as the primary tool, users should be able to expect that (1) any backup in Velero which is not yet expired has not yet been cleaned from archival storage, and that (2) any backup version which has expired in Velero can be expected to at least, eventually, expire.

Velero's idea of when something should expire is not exposed in the VolumeSnapshot. Even if it were, I'd have reservations about trusting it, because backup versions can be "frozen" in Velero, to prevent deletion, and we'd need to have that context propagate out to the VolumeSnapshot, too.

Given that, I think the approach is to not set an expiration time for Benji. That handles the (1) case. In order to clean things up, though, we would need an additional hook to discover which snapshots have already been expired by Velero on the live storage (by seeing which RBD snapshots have been deleted), and then immediately mark the corresponding Benji versions as expired. I don't know whether this is possible.
Ensuring that this tool runs frequently enough to prevent failing to archive versions.

I don't think there's anything to be done here but to guide users to schedule the benji-backup-existing-snapshots script to run at least as frequently as Velero creates backups. I'd appreciate feedback on whether you think that's the right approach, too, though.
Making this script general enough to be of any use.

As it stands, I have only my home environment to test in. (Necessity is, of course, the mother of invention.) I would guess that my use-case (Rook Ceph backed by hostPath OSDs, connected to Velero using CSI) is not unique, but there's undoubtedly going to be variation that I wouldn't expect.

I think I can mitigate some of that by piggy-backing on the existing scripting tools and helpers, but I'd like your perspective on whether this is something other folks could conceive of reusing, and what factors to be aware of in writing this script for broader consumption.

Thanks for reading! Sasha

elemental-lf commented 1 year ago

Thanks for the write-up , Sascha. I actually thought that Velero is also taking care of the actual volume content but live and learn.

I like the idea of backing up VolumeSnapshotContents with Benji. Maybe this could be extended to be a generic mechanism in the future where the snapshot is mounted as a block device inside a pod and so Benji could backup any form of volume.
Regarding the integration with Velero have you looked into the plugin system (https://velero.io/docs/v1.9/custom-plugins/#plugin-kinds). Maybe this would also address the expiry problem and running the backup in sync with Velero (your second to last bullet point). A good starting point for looking into it will probably the CSI plugin. I'm aware that this would be highly Velero specific but we could see which parts we could generalize and put them somewhere outside of the plugin.

vriabyk commented 1 year ago

Hi people, we are interested in this feature too.

Currently we are facing issue when using benji for backing up k8s pvcs (ceph csi). Benji creates rbd snapshot in ceph and then uses it to create further incremental backups. But when someone deletes k8s pvc, Ceph CSI successfully deletes pvc/pv objects from k8s while the rbd image gets stuck in rbd trash because ceph can't delete image which has snapshots. When number of such images in trash grows to 1025 (not sure if it is some configurable limit or not), ceph csi stops provisioning new pvcs at all.

I created feature request for ceph csi developers, asking to implement at least an option to remove rbd snapshots before deleting k8s objects. But they don't want to do so.

Therefore we are looking for a way to workaround it. If benji will work with VolumeSnapshotContent instead of direct access to Ceph, then Ceph CSI should clean up everything properly and we won't face stuck images in trash.

@elemental-lf pls let me know if it is clear and if you need any more details or assistance.

elemental-lf commented 1 year ago

@vriabyk my plan is to pull the whole workflow of making a snapshot, getting the differences and then doing the actual backup into Argo Workflow based workflows. That way it would be easier to extend the workflow for using VolumeSnapshotContent. The base for this just landed in master. Currently it's an almost 1:1 translation of the old CronJob based setup but you could start from there.

elemental-lf / benji