Add new --reflink-dest option to be used instead of --link-dest

jrw commented 3 years ago

When using --link-dest a "new" hardlink is created when it is determined that a "prior" file is the "same" as the source file being transferred to the destination location. I would like a new option "--reflink-dest" which works the same as --link-dest, except that a reflink is created from the "prior" file instead of hardlink, and the metadata from the source file is copied to the new file where possible.

Reflinks are so much better than hardlinks for many use cases. For example, when the files are hardlinked, a change to the prior file also changes the new file. Also, any attributes which are allowed to be different between the source file and the prior file (for them to be considered the "same") are necessarily lost when creating hardlinks, since both the new file and the prior file must have the exact same metadata. Because reflinks are independent files which simply share COW file data, not metadata, a subsequent change to the prior file would not change the reflinked file, nor would the reflinked file have to share metadata with the prior file, other than that required by rsync's algorithm to consider them the "same", which triggers the reflink process in the first place.

I would expect --reflink-dest to be implemented in this way: When the situation is detected where --link-dest would create a hardlink between the prior and new files, instead (when using --reflink-dest) create a new empty file, reflink (FICLONE) the new file to the prior file, and update the new file with the source file's metadata. In particular, creating a reflink changes the mtime of a file, so the mtime will need to be updated at a minimum.

WayneD commented 3 years ago

See work in progress:

https://github.com/WayneD/rsync-patches/blob/master/clone-dest.diff

Sounds like mtimes are not currently handled correctly.

jrw commented 3 years ago

I can imagine a couple scenarios:

cp can already handle making reflinks. But rsync is like a more flexible version of cp. So, one scenario is to use rsync like a more flexible cp
rsync --link-dest is like cp --link, but with more flexibility and the ability to update the hardlinked files from another copy. This is very useful for making "incremental" backups but with an image of the entire backup tree available for browsing. However, it has some drawbacks. For example, modifying a hardlinked file in the backup tree would modify all the hardlinked copies.
The hypothetical rsync --reflink-dest suggested above, would be similar to rsync --link-dest, but without any of the drawbacks of hardlinks. However, ignoring the rsync "same file" algorithm used by rsync --link-dest (as --clone-dest does) would not have the same effect. That's why I suggest that the --reflink-dest option should be exactly equivalent to --link-dest, except that reflinks should be used whenever hardlinks would be used by --link-dest.

In reality, what I really want is online deduplication, but that's still missing from btrfs. Even --reflink-dest described above would not reflink a file if the mtime changed but the file contents did not (because the rsync link-dest algorithm takes mtime into account).

BTW, the current manpage description of --link-dest is somewhat confusing to me. It says:

The files must be identical in all preserved attributes (e.g. permissions, possibly ownership) in order for the files to be linked together.

But I believe that the "quick check" file size and mtime are always taken into consideration in determining if a file can be hardlinked, even if mtime is not preserved. Perhaps the --link-dest description should mention that the quick check algorithm is used and additionally preserved attributes must be identical. (I do understand that the quick check algorithm can be modified by options like --size-only.)

nanosparks commented 3 years ago

Can you also include apfs clonefile(2) support in --clone-dest. Also, it would be great if rsync could preserve reflinks/clones from the source in the destination. If the source volume allows reflinks or clones but the destination volume doesn't, it would be great to have the option to replace reflinks/clones in the source with hardlinks in the destination.

sinyb commented 3 years ago

I often have to rsync very large files (100's of GB, even a few TB per file, qcow2 images) over the Internet.

I found out by trial and error that the best option is to not to use --inplace (because new data can get inserted in the middle of the file, and after that checksum become invalid and everything else get copied over).

Without --inplace, rsync creates new temporary file, copies unchanged blocks from existing file and inserts any new data. At the moment this is about 10 to 100 times faster than copying over the Internet, but still takes quite some time because we all know that copying from a disk to the same (rotating) disk is much slower than the disk's maximum read or write performance.

Now, what I would propose is to reflink unchanged blocks (on supported filesystems, which now includes my favorite XFS), and copy only changed blocks. I am aware that this might require that rsync know many more details about filesystem internals (like fs type, block size, etc), but this would speed up the whole process considerably, and also allow to use SSD even for backups, because there would not be as many writes required for a full copy of the image (I already have SSD at the source side as the VM storage). Fall-back to actual copy if filesystem is not compatible.

As I see it, this would have so many advantages with no drawbacks that I'd like it to be default behavior, but an option to turn it on would also be fine.

Thank you!

mgutt commented 2 years ago

Please consider XFS as well.

I really like to create reflink file copies as follows:

rsync -a --reflink-dest=/mnt/xfsvolume2/backup/monday /mnt/xfsvolume1/vm/ubuntu/ /mnt/xfsvolume2/backup/tuesday

WayneD commented 2 years ago

Try the clone-dest patch linked above. It uses the FICLONE ioctl on Linux, which should presumably work with whatever FS supports cloning.

xiota commented 2 years ago

My understanding is the clone-dest patch isn't merged because "Nobody has tested and reported back." Is there anything in particular that needs to be tested? Is there some number of people you would want using the patch before merging?

I've tried the patch with XFS and Btrfs, and it seems to work as expected. I'm reasonably comfortable updating my aliases and backup scripts to use it. On ext4, which doesn't support reflinks, it errors out, which is reasonable. But it would be nice to be able to fallback to hardlinks or normal copy.

sinyb commented 2 years ago

Please take a look at my comment here https://github.com/WayneD/rsync/issues/119#issuecomment-1252653055 in short, this works for whole files fine, just as --link-dest does, but the real strength of reflinks is having files that are mostly same, but with some changed blocks (VM images).

xiota commented 2 years ago

the real strength of reflinks is having files that are mostly same, but with some changed blocks (VM images).

That seems more like a copy-on-write feature than reflink-proper. Once the reflink is created, COW filesystems will do their thing. Reflinks are useful for making snapshots, so that the active file can change without changing the snapshot, as would happen with a hardlink.

To increase usage / testing exposure, I've created AUR packages rsync-reflink and rsync-reflink-git.

sinyb commented 2 years ago

the real strength of reflinks is having files that are mostly same, but with some changed blocks (VM images).

That seems more like a copy-on-write feature than reflink-proper. Once the reflink is created, COW filesystems will do their thing.

Reflinks are useful for making snapshots, so that the active file can change without changing the snapshot, as would happen with a hardlink.

Yes, and I am using them just in that way: create snapshot of VM in a few seconds, eliminating downtime or reducing it to seconds or minutes, not hours, then rsync-ing to remote server. Currently, I am using "rsync --backup --no-whole-file" to minimize transfer size, but the whole image still has to be copied on destination, and that is slowing things down a lot compared to just reflinking same blocks.

Here is a report from one representative copy:

Number of files: 1 (reg: 1)
 . . .
Total file size: 171,825,299,456 bytes
Total transferred file size: 171,825,299,456 bytes
Literal data: 15,073,280 bytes
Matched data: 171,810,226,176 bytes
  . . .
File list generation time: 1,577.409 seconds     (????)
File list transfer time: 0.000 seconds
Total bytes sent: 20,320,621
Total bytes received: 11,799,062

sent **20,320,621** bytes  received 11,799,062 bytes  4,356.69 bytes/sec
total size is 171,825,299,456  speedup is 5,349.53

Start time: Fri Sep 30 23:08:31 End time: Sat Oct 1 01:11:23

It took ~2 hours to copy 170GB image and apply 20MB changes (which got trasferred in ~4 seconds total at 70Mbps uplink). Destination disks are 2 WD Gold HDDs in MD-RAID mirror. Local copy works at about 90-100MB/s. The whole process could be done in less than 30 minutes using reflinks (file still needs to be read). I could also use SSDs to speed up the process even more, because there would not be unnecessary wear on them.

By the way, I have tried first "copy --reflink" of the destination, then using "rsync --inplace" on that copy, but that was not network efficient, because as soon as there was a change in file, rsync would trasfer the rest of file, which took ages to finish...

My final goal is to have multiple copies of the image in the remote location, with just the differences using up additional space, just as they do in the main location after snapshot.

mgutt commented 2 years ago

Something which came into my mind: Let's say we have a folder with 1000 files (real files which consist of multiple fragments) and we create our first backup (on the same partition to keep the fragments):

cp -r --reflink=always /source /full_backup

Now we leave the files unchanged and create two additional copies to compare the speed of creating reflinks and hardlinks:

rsync --archive --reflink-dest=/full_backup /source /reflink_backup

rsync --archive --link-dest=/full_backup /source /hardlink_backup

Idea: Maybe it is faster, to create hardlinks of unchanged fragmented files. If this is the case, we should think about merging the reflink feature into the existing --link-dest option and only create reflinks, if the file has changed, while unchanged files get a simple hardlink. Reflinks should be then created by default without an additional option if the filesystem supports them and can be disabled by a --no-reflink option.

PS if you are unsure that your /source contains files with fragments, then use this command:

find /source -type f -exec filefrag -k {} + | grep -v "1 extent found"

xiota commented 2 years ago

Maybe it is faster, to create hardlinks...

You forgot to account for caching. Whichever command you run first will be faster.

... only create reflinks, if the file has changed, while unchanged files get a simple hardlink...

Reflinks and hardlinks have different uses. Reflink contents can diverge, while hardlink contents remain in sync.

You're also mixing up reflinks with copy on write. Not all file systems that support reflinks support copy on write.

mgutt commented 1 year ago

You forgot to account for caching. Whichever command you run first will be faster.

Then run sync; echo 1 > /proc/sys/vm/drop_caches before running the commands.

Reflinks and hardlinks have different uses. Reflink contents can diverge, while hardlink contents remain in sync.

I know. Thats why I think it could be faster using hardlinks instead of reflinks if the files did not change.

You're also mixing up reflinks with copy on write

Reflinks are copy-on-write?! The first step is to link on the file extents, but after changes are made to one or more extents, they are copied to a new position (this is how COW works). Of course XFS is not a COW filesystem by default, but it uses COW to allow the usage of --reflink.

xiota commented 1 year ago

Copy on write is the mechanism by which reflinks are broken when the contents diverge. The reflink itself is not copy on write. Creating a reflink, where contents have not yet diverged, does not involve the copy on write mechanism. The overhead difference between creating a reflink vs a hardlink amounts to some metadata differences. The miniscule performance increase is not worth the semantic confusion that would result from replacing reflinks with hardlinks, when the user explicitly wants reflinks.

vontrapp commented 1 year ago

Now, what I would propose is to reflink unchanged blocks (on supported filesystems, which now includes my favorite XFS), and copy only changed blocks. I am aware that this might require that rsync know many more details about filesystem internals (like fs type, block size, etc), but this would speed up the whole process considerably, and also allow to use SSD even for backups, because there would not be as many writes required for a full copy of the image (I already have SSD at the source side as the VM storage). Fall-back to actual copy if filesystem is not compatible.

As I see it, this would have so many advantages with no drawbacks that I'd like it to be default behavior, but an option to turn it on would also be fine.

I 100% agree with this and think this is the right way to implement a reflink feature in rsync. If a --reflink-dest is still desirable in addition then that can be good too, but it brings some limitations that are not optimal, such as only reflinking if all the attributes match and not reflinking at all if a file is modified but a great strength of reflinks is exactly when a file is modified, but still largely the same.

So, rsync does the delta algorithm, finds matching blocks in a pre-existing file and copies the blocks to the new temporary file it is building to match the source. With reflinks enabled (and this really could be default-with-fallback) sequences of blocks are reflinked instead of copied and efficient incremental backups become automatic and essentially free with rsync when using a supported filesystem.

vontrapp commented 1 year ago

I've begun looking into this and there's some nuances to deal with, firstly that block alignment matters.

rsync will take a sender file that has 1 byte added at the beginning, and will copy the entire reference file with a 1 byte offset to the new receiver file. This is of course great, but this or a similar scenario would thwart all reflink savings for that file. This is fine but may be 'unexpected' behavior to a user that gets used to reflink savings and sees those savings disappear in some cases. This also means that code to handle cloning from one file to another will need to check what block alignment is required and that cloned blocks are aligned, if aligned do the clone for the fully aligned blocks, and do map_file copies for sections that are not aligned. Note that this is length aligned (full blocks) and offset aligned in both files. On the angle of user surprise, rsync could output the reflink saving statistics as well and in a similar way to network saving statistics. This would assure the user that a reflink feature is active and show that rsync is aware and possibly expecting that certain reflink savings could not be had.

It appears that, at least in xfs and btrfs, the same extents cannot be reflinked twice in the same file. So this will also need to be checked or, if an error condition can be detected, retried with normal copy.

vontrapp commented 1 year ago

Doing some testing I've proved the opposite for at least xfs, it does allow same extents referenced multiple times in the same file. The python function refuses to accomplish this using only the same file (write x bytes, then link x bytes of offset x in same file). I don't know if that's a limitation of the underlying ioctl at this point, but if it's actually allowed that makes inplace transfers better.

I'm going to play with some C sample programs to test the edge cases, then I'll look at making a pr on rsync. This is my current plan and please anyone with familiarity of the rsync codebase chime in if there's better ways or I'm missing anything.

There's a copy_file function in fileio.c, this one seems straightforward enough to modify. Just copy the file one way, then fall back to another way, possibly calling the same copy_file_range that will be used for the other case - which has benefits even outside of reflink supporting filesystems.

The other case is in the receiver, when receiving a matched range from the sender. First it checks if it's an 'inplace' (or like) destination and if the sender offset and the destination offset are equal, and if so calls 'skip_matched'. If not 'inplace' then the receiver goes on to directly do a file_map copy to the destination.

First, the skip_matched could potentially benefit from reflinking the matched blocks to the same file if already seen earlier in the file. I would think this is worth pursuing but would probably require some extra bookkeeping, but maybe the sender already gives sufficient information for this? Perhaps in the case of a second range of matched the sender gives the first offset for the source and the second offset for the destination? If so, no change to this portion.

Then, conditional #ifdef HAVE_FILE_COPY_RANGE and call do_copy_file_range which does the following:

Check if the offsets are block alligned (the difference between offsets is a multiple of block alignment size)
- if not, call copy_file_range on whole block. Check written bytes returned, write any remainder bytes, potentially calling copy_file_range again
The offsets are block co-aligned, so first do a write to get fully block aligned, probably also using copy_file_range
Then copy_file_range the span of fully aligned blocks
Then copy_file_range the remainder, if any

The reason for breaking it up like that is that in my testing (so far only with python bindings) the behavior is: if the first block in the range is aligned, copy all the sequential aligned blocks by reflink. If there's a remainder portion that is not block aligned (e.g. only a portion of the existing source block is requested into the destination) then it *silently drops it (not completely silent, as the return value of bytes written will be less that amount not written). If the first block in the range is NOT aligned, then the entire range is written without any reflinking. If the source and destination are the same file descriptor, throws an error (I don't think this is correct and may only be the python doing this).

vontrapp commented 1 year ago

Confirmed behavior of copy_file_range with test c program. If the start of the range is not a fully aligned to a block in both src and dst, even if other blocks within the range are aligned, then the whole range is not reflinked. So definitely want to take care of any pre-writes that achieve aligned blocks at the start of a range.

Also same as python, if the last of the range is not a full block or are not equal partial blocks, then the last (partial) block is not written at all, the returned bytes written reflects this. Edit: this does not hold true in the above case where a copy, not a reflink, is taking place. In that scenario all bytes in the range are copied.

Also reflinking ranges within the same file is allowed, only if the ranges do not overlap. Apparently my python test did have overlapping range, I tried it again in python and it also allows the same fd for in and out, provided the ranges do not overlap.

Now another question, would it be worth trying other methods in addition to copy_file_range? One reason is to support platforms that do not yet have the copy_file_range but do have ficlonerange, as copy_file_range did come later from what I recall. Any other methods?

P.S. All tests done on xfs filesystem, kernel 5.4.0-144-generic

GottZ commented 1 year ago

Hi all!

Does reflink cause source fragmentation on change? How does XFS deal with that? I assume, it keeps track of block references and just does a re-allocation of changed blocks in source with fragmentation and in really bad cases, does a COW.

if it would be rsync's default in future, it could probably slow down spinning drives a lot, right? (In my humble opinion SSD's and JBOD's are likely unaffected unless they use 512 bit blocks with 4096 physical in XFS due to silly user configuration)

Out of caution I'd only suggest making it a default if block size and physical block size match, or block size is a multiple of the physical block size. Additionally, only if it's not a spinning drive.

So.. kind of hardcore... gosh I love it.

newbie-02 commented 3 months ago

citing WayneD

Sounds like mtimes are not currently handled correctly.

would like that improved, What do we want / is needed? Backup: a clone with identic metadata, else orientation in backups becomes difficult. What do we want to save space? Data-deduplication. Rsync would be nice as it could even save some writes to e.g. flash drives, instead read and compare, saving them livetime and speeding up. But: '--clone-dest=xxx' sets mtime to actual.
'cp' can do with -p, touch can do with -r, why can't we provide a clone with cloned mtime?
Workaround - not fully tested - btrfs with 'bees', but much effort in 'post processing', and not! flash friendly'.

newbie-02 commented 3 months ago

trial and erroring around as I'd really like it working, I was near to: first do a --clone-dest backup, and then find . -exec touch -r origin/'{}' '{}' \; to apply useful metadata ... but consider such ... is a crutch.

But then observed that rsync -ahv --checksum --clone-dest=../backup_1/ origin/ backup_2 works in
reflinking and preserving metadata, takes some time for reading and checksumming, but a quite big backup can run in background without blocking the system.

As I'm new in these matters I'd like it rechecked by others.

My actual strategy: uncompressed btrfs filesystem in LUKS encrypted partition, rsync with --clone-dest, -ahv and --checksum makes clones avoiding unnecessary writes, but only links 'same position' files, thus after that a run of beesd can look if there are options to improve.

Any hints for improvement and simplification welcome.

digitalsignalperson commented 2 months ago

I'm trying to understand the options for dealing with e.g. 20GB VM image file having ~20MB of changes between local reflink snapshots. This was mentioned by @sinyb as well.

the real strength of reflinks is having files that are mostly same, but with some changed blocks (VM images).

By the way, I have tried first "copy --reflink" of the destination, then using "rsync --inplace" on that copy, but that was not network efficient, because as soon as there was a change in file, rsync would trasfer the rest of file, which took ages to finish...

I don't understand the "as soon as there was a change in file, rsync would trasfer the rest of file". Is this referring to "The efficiency of rsync's delta-transfer algorithm may be reduced if some data in the destination file is overwritten before it can be copied to a position later in the file." from the manpage on --inplace? I thought it work work like this, where nobody else is reading/writing to the destination during the transfer:

host A: take reflink snapshot in /mnt/snap1
host B: initialize new snapshot from previous, something like: cp -ar --reflink /mnt/snap0 /mnt/snap1
host A: rsync -a --inplace /mnt/snap1/ root@host_b:/mnt/snap1/

Would that only transfer the 20MB changed?

Also I haven't seen any mention to dupremove. It supports XFS and btrfs. It allows non-identical files to share duplicated blocks they both contain.

duperemove is a simple tool for finding duplicated regions in files and submitting them for deduplication. When given a list of files it will hash their contents and compare those hashes to each other, finding and categorizing regions that match each other. When given the -d option, duperemove will submit those regions for deduplication using the Linux kernel FIDEDUPERANGE ioctl.

It's nice that could be used after an rsync transfer, but just costly in time and space temporary consumed before the dedup. I wonder if any of their approach can be of use here for rsync.

RsyncProject / rsync

Add new --reflink-dest option to be used instead of --link-dest #153