Optimising archive updates

GrahamCobb commented 8 years ago

This really isn't an issue, it is a question. Feel free to close it if you don't want this cluttering up your issues section..

I have been thinking about whether it is possible to automatically optimise updating archive copies. By this, I mean, sending the minimum changes to allow the archive to be updated to the current disk contents.

In the case I am looking at, I do regular btrbk snapshots and backups (to another disk but online on the same server). I also do occasional archive copies for offsite backup. I tend to do those using rsync, so they can be easily updated when I feel like it with only copying updated files.

I was wondering if it was possible for a script to use a btrfs snapshot to create the archive disk and then later, when the archive is next to be updated, just send the differences between the previous snapshot and a current snapshot. But without me (or the script) having to keep a reliable record or planning ahead (i.e. just from information in the archive snapshot itself).

I suppose the source disk would have to have kept a copy of the previous snapshot (or else it would end up having to work out the most recent parent that does still exist on both the archive disk and the source disk and use that!).

Is this even possible?

digint commented 8 years ago

This really isn't an issue, it is a question. Feel free to close it if you don't want this cluttering up your issues section..

That's fine, and allows other watchers to profit from it. I simply flag this issue as "question".

Looks like you're asking for btrbk archive, which exists since v0.23.0. As an example, let me explain how I archive my precious backups. I have an (online) backup disk, which gets btrbk backups daily, with the following settings:

target_preserve_min    no
target_preserve        60d 52w  *m

Now I always fear total data-loss, so I also have a (offline) archive disk in a safe in the basement. I want it to have a copy of all monthly backups (and 30 daily back, because this does not really hurt), so I also configure:

archive_preserve_min   latest
archive_preserve       30d      *m

Once in a while, I connect the archive disk, and run:

# btrbk archive /mnt/online_backup_disk /mnt/archive_disk

This incrementally copies all the required backup subvolumes to my archive disk. All I need to make sure is that at least one archived backup is still present on your (online) backup disk on the next run. This is not a problem here as I keep monthly backups forever (and will only carefully delete them manually if the disk runs out of space).

Note that btrbk archive is meant to make archives from a backup disk, and NOT from a source disk. The point here is that you probably don't want to keep old snapshot lying around on your source.

Note that this feature is still flagged experimental, basically because I did not test it for all possible setups. I would be happy to get feedback if you get it to work. Also make sure to have a linux kernel >=4.4 if you want to use this.

GrahamCobb commented 8 years ago

Thanks for the answer. I have a few more questions about how btrbk archive works...

Does btrbk archive minimise the data transfer by sending incrementals for all the snapshots? So if, for example, I have Jan, Feb and Mar backups on the archive disk and then update the archive in June it would send the Mar-Apr diffs, the Apr-May diffs and the May-June diffs? I think that is what you said in your answer -- just checking I understood correctly.
Does btrbk archive work out for itself what is the most recent common snapshot on both the backup and archive disks, without having to keep track in some database? If so, how does it do it? Is there some sort of UID in the snapshots on the backup and archive disks which will match?
If I want to copy the archive disk but still allow btrbk archive to be able to update the new copy, I presume I have to use btrfs send/btrfs receive to do the copy, to preserve whatever information is being used in the previous answer. The scenario I have in mind is possibly having the archive sitting on a cloud server, but I might want to initially get the data loaded into the cloud by sending a physical disk for them to copy (and then using btrbk archive to update it remotely).
Lastly, can you think of any way to have the archive disk encrypted (but not have the source or the local backup encrypted)?

digint commented 8 years ago

Yes exactly. To be more precise: btrbk always creates incremental from "latest common subvolumes". In order to satisfy the preservation policy (archive_preserve, archive_preserve_min) it compares what is already present and what is needed on the target, and makes incremental from the "latest common" on both sides, by comparing uuid and received_uuid.
btrbk archive works for itself, without even need of a configuration. All it takes from the config (if present) is the preserve policy (archive_preserve, archive_preserve_min). Without config, it assumes default archive_preserve_min all, which copy all subvolumes. So no, no database, only uuid <-> received_uuid dependencies from the filesystem. Play around with --dry-run -l debug to see more detail.
From user perspective, an archive is a 1:1 copy. A subvolume transferred with send/receive always keeps the same received_uuid (as of kernel 4.4). btrbk is stateless, with no database or extra files by design. It will recognize the subvolumes by their received_uuid, no matter if you transfer a disk or send/receive it.
Create the btrfs filesystem on the archive disk on top of dm-crypt. Works very well, I use dm-crypt/LUKS with btrfs here without problems for almost two years. Arch Linux has good documentation for this: https://wiki.archlinux.org/index.php/Dm-crypt

digint / btrbk

Optimising archive updates #90