borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
10.8k stars 735 forks source link

files cache: index by physical extents to support reflinks & snapshots #2743

Open alphazo opened 7 years ago

alphazo commented 7 years ago

XFS has implemented some new (experimental) exciting features such as reflink that allows instant CoW snapshots similar to what is found on btrfs. I don't think there is a plan to support send/receive commands like on ZFS so the dedup function is pretty much limited to the local filesystem. I'm envisioning to use the following scheme for my external USB drive that contains my photos and that I usually backup on a second drive or network storage using borg. This applies to both btrfs and the new reflink enabled xfs.

So the borg-snap1 snapshot will contain all the different snapshots I performed while away from home plus the working directory. But since borg doesn't know about the reflink feature it will rescan each of the files found in each snapshot found on my photo drive thinking they are new files but will ultimately find corresponding known dedup blocks so it will effectively not copy over each of the btrfs/xfs pseudo snapshot. I tried it and the size of my photo directore + many snapshots of the same pictures gave pretty much the size of the photo directory on the borg snapshot which is great. I was wondering it there would be a way to have a new feature in borg to detect such reflink enabled filesystems (btrfs/xfs) so it would immediately know that a file found in a btrfs/xfs directory is a duplicate of an existing known one and therefore use the same dedup blocks.

enkore commented 7 years ago

According to [1] there is no way to find physical extents (the backing element of reflinks) without either risking data corruption (when btrfs compression is used) or writing code that parses btrfs data structures. Apart from that it could be incorporated into the files cache (key := id-hash(NUL || "physical-extents" || extent-descriptor...)).

[1] https://www.spinics.net/lists/linux-btrfs/msg60845.html

alphazo commented 7 years ago

Would the new reverse mapping (rmapbt) support on xfs be of any help for identifying CoW files ?

https://lwn.net/Articles/695290/ https://lwn.net/Articles/659677/

enkore commented 7 years ago

I don't see a problem on XFS, apart from this being a rather fickle business overall.

It's the btrfs issue I linked to above that seems problematic to me (there needs to be a reliable way to detect compressed files/extents to work around it). coreutils hints at problems with ext4, though the comments are old. Maybepossibly fixed.

I'm going to be straight here and say that this won't be implementable casually. I'd estimate that implementing this will take 1-2 developer weeks, i.e. quite an effort.

alphazo commented 7 years ago

Understood. Thanks for takin the time to answer. Those new XFS features (reflink + rmapbt) are still marked as Experimental anyway. I'm going to give them a try on a controlled environment and see how they perform. By the time XFS reflink goes primetime more people might express a need for such feature on borg. I find it a good balance between btrfs and its flacky RAID support and ZFS that is not available straight in the kernel. Cheap snapshots + borg for real dedup backup is probably going to be my next workhorse.

alphazo commented 7 years ago

Some more pointers from the xfs folks:

borgbackup will probably need to call the GETFSMAP ioctl, which won't land until 4.12. On xfs, rmapbt is needed to supply data block ownership info, which is what borgbackup (and bees, and...) say they want to be smarter about dedup.

https://www.spinics.net/lists/linux-xfs/msg08128.html

jcharaoui commented 6 years ago

With the release of Linux 4.16, the XFS reflink feature is no longer tagged EXPERIMENTAL.

alphazo commented 6 years ago

@jcharaoui Thanks for pointing this out. I used those features on nearly a year on my photo hard drive and haven't seen any problem. Now I believe that the rmapt feature is also no longer experimental (I'm running Linux 4.16.9) since I no longer see those red warnings when mounting my external drive that has both reflink and rmapbt enabled.

srd424 commented 4 years ago

Interested in this while watching the first borg backup of a btrfs-based container pool take forever :) Note that duperemove claims to check shared extents when working out whether to hash files ..

charles-dyfis-net commented 4 years ago

https://gist.github.com/charles-dyfis-net/bfb0e30862f04957d020afe0ff8b093b may be of interest to those here -- invoking xfs_io to reflink together identical chunks of content Borg has identified.

Not maintained, not recently tested, not documented at time of development and use, very much YMMV.

srd424 commented 4 years ago

For my use case I'm now investigating https://github.com/systemd/casync, which has btrfs reflink support (don't know if it works on xfs.) I hit a few bugs, but worked out fixes for a couple (systemd/casync#235, systemd/casync#237 - now merged) and found work-arounds for the other two (systemd/casync#239, systemd/casync#240.)

charles-dyfis-net commented 4 years ago

For my use case I'm now investigating https://github.com/systemd/casync, which has btrfs reflink support (don't know if it works on xfs.) I hit a few bugs, but worked out fixes for a couple (systemd/casync#235, systemd/casync#237 - now merged) and found work-arounds for the other two (systemd/casync#239, systemd/casync#240.)

At the risk of plugging a project I'm a contributor to, I strongly suggest also looking into https://github.com/folbricht/desync. casync may have gotten better over time, but back when desync was started, casync's error handling was atrocious; and desync very much does presently support reflinks when content already exists in another, local .caibx.

srd424 commented 4 years ago

I'd looked at desync and thought it didn't support reflinking, but it seems I might be mistaken .. will revisit! casync .. does have some issues.

charles-dyfis-net commented 4 years ago

I'd looked at desync and thought it didn't support reflinking, but it seems I might be mistaken .. will revisit! casync .. does have some issues.

Depending on when you looked it may not have; but it most definitely does today. See https://github.com/folbricht/desync/blob/4a8700c059471d5f005dd7c9a957072bb1fa5c8a/fileseed.go#L123-L126

srd424 commented 4 years ago

Yup, just found that - I only looked the other day but I think I'd mis-parsed the beginning of the README. Going off topic here, but quickly - desync doesn't seem to support ACLs in the catar stream, but does now implement xattrs - does that mean it can restore ACLs from catar streams that it generates itself?

charles-dyfis-net commented 4 years ago

Yup, just found that - I only looked the other day but I think I'd mis-parsed the beginning of the README. Going off topic here, but quickly - desync doesn't seem to support ACLs in the catar stream, but does now implement xattrs - does that mean it can restore ACLs from catar streams that it generates itself?

Couldn't say. I'm a regular user and occasional contributor, but I don't use that particular functionality. Frank does have a gitter chat room, though -- I'd suggest asking in there.