RsyncProject / rsync

An open source utility that provides fast incremental file transfer. It also has useful features for backup and restore operations among many other use cases.
https://rsync.samba.org
Other
2.91k stars 339 forks source link

--copy-cmd option to allow use of custom copy commands #63

Open Haravikk opened 4 years ago

Haravikk commented 4 years ago

This proposal is for the addition of a new option in the following form (name debatable):

--copy-cmd=COMMAND

When provided, instead of copying a file using the default method (delta transfer), rsync will instead use the provided custom command. The command is given as a string, similar to the --rsh option. This command only applies to file copies, other actions (metadata updates, linking etc.) occur as normal.

How it actually behaves differs depending upon the nature of the source and destination:

Regardless of mode of operation, if the copy command returns a non zero status, rsync will treat the transfer as failed and produce an error or warning. It will also produce a warning in local -> local and remote -> local modes if no file is produced at the expected destination (i.e- command did not include the {dest} placeholder, didn't use it properly, or failed with a status of zero, which can happen if the command is several piped together).

This option would be incompatible with --append (also --inplace?). Use of this option will also disable the use of file-size for comparisons by default, as a custom command may produce a differently sized output file. An option will be needed to tell rsync to explicitly retain this behaviour, e.g- --copy-size, for when the sizes should match (when --copy-cmd is used for transparent compression, cloning etc.). Comparisons by modification time however should work as normal no matter what the copy command produces, as rsync should still be setting the time(s) on the file afterwards.

Examples

There are a few useful examples of how you could take advantage of this command:

Local File Cloning

Some filesystems support the use of clone/shadow-copy/reflink based zero-cost copying of files which functions similar to hard linking, except that each clone can be written to independently without affecting the other(s), i.e- at time of cloning they share the same data blocks on disk, but when written to they diverge, usually thanks to copy-on-write. To take advantage of this you might use --copy-cmd like so:

This is useful when you know you want to copy something for editing, but want it to be as quick and lightweight as possible, but where the plain command (cp -c or cp --reflink) doesn't offer the same flexibility that rsync does. This is also useful when you want to snapshot just a single directory, even though the full volume might support snapshots (as systems that support these commands usually do).

Invisible Compression

Some filesystems support on-demand per-file compression; for example on macOS, HFS+ and APFS both support invisible file compression. While there are patches that allow rsync to preserve this where a file is already compressed, there may be cases where you'd like to add/remove compression while copying, e.g- ensuring backups use up as little space as possible. You could do this using --copy-cmd like so:

rsync -a --copy-cmd='ditto --hfsCompression {src} {dest}' /path/to/source /path/to/destination

In this case rsync will ensure that all copied files have compression enabled where possible in the destination, even if the files were not compressed in the source.

Explicit Compression

On other file-systems, you may still wish to compress file contents when rsync'ing for backup, copying to a mobile drive etc. You could do this using --copy-cmd and gzip (or xz or similar) to compress destination files, like-so:

To decompress:

Implementation Considerations

For examples such as explicit compression, it may be useful to provide a supporting option --copy-cmd-out-ext or similar, so that files compressed using --copy-cmd can have a customised extension, for example --copy-cmd-out-ext=.gz such that rsync remains aware of the change in name, i.e- for a file with path foo/bar/baz, rsync would treat it on the destination side as foo/bar/baz.gz, but look for both versions (in-case the file was previously transferred without this extension). This would also benefit from a --copy-cmd-in-ext when reversing the direction of a copy, this would instead inform rsync to remove the extension if found on an incoming file (foo/bar/baz.gz becomes foo/bar/baz).

As a --copy-cmd may not be able to place the same guarantee on the correctness of attributes, these should be set after the copy command has been executed (does rsync already set attributes after transfer?).

edo1 commented 4 years ago

On other file-systems, you may still wish to compress file contents when rsync'ing for backup, copying to a mobile drive etc. You could do this using --copy-cmd and gzip (or xz or similar) to compress destination files, like-so:

Hmm… Let's have a big file (e.x. 4Gb) on the server.

  1. Client runs rync server:bigfile .;
    sent 43 bytes  received 4,296,015,958 bytes  8,635,208.04 bytes/sec
    total size is 4,294,967,296  speedup is 1.00
  2. Some changes on the server occurs;
    dd if=/dev/urandom of=bigfile bs=1k count=1 seek=1000000 conv=notrunc
  3. Client runs rync server:bigfile . again. Only changes (and checksums) are transferred.
    sent 524,363 bytes  received 327,794 bytes  28,886.68 bytes/sec
    total size is 4,294,967,296  speedup is 5,040.11

IMO such rsync behavior is incompatible with your proposal.

Haravikk commented 4 years ago

I'm not sure what you're trying to demonstrate here?

One of my opening lines is:

When provided, instead of copying a file using the default method (delta transfer), rsync will instead use the provided custom command.

The trade off for using the custom command is that rsync can't do checksum based transfers, but you're still gaining the full benefits of rsync's various features for finding changed files (by timestamp, and optionally size if you know it should be the same), plus filtering of transfer lists etc.

You would (and should) only use a custom copy command in cases where the benefits of doing so are well known to be better than relying on rsync's checksummed transfer algorithm, i.e- you know that changed files will need to be copied (or cloned) in their entirety.

But there are plenty of cases where the copy command itself will lack a lot of rsync's flexibility; cp as I've used in these examples doesn't have any of the finding/filtering/comparison options that rsync does, and nor do compression tools.

There could be an argument for an intermediate transfer option, e.g- for compression, the command is used to generate a file locally for transfer, so the per-block comparison can still occur with a previously compressed version of the file, but this would need to be an additional option as it won't be suitable for every command.

WayneD commented 4 years ago

Just so you know, this is fairly unlikely to be implemented. It would likely be limited to local copies only, so if I ever get around to doing some big changes to rsync's local copying workflow then I will be considering this idea as an additional feature.

Haravikk commented 4 years ago

I appreciate any consideration of this; I should stress, not all options are required, I have a tendency to overthink stuff like this, and the ability to do this with remote transfers is always something that can be done later.

In terms of local transfers the only changes that are required are:

  1. --copy-cmd to take the command, once a file is identified for transfer this is executed in place of normal behaviour, disables comparison of files by size by default.
  2. --copy-cmd-size to re-enable comparison by size when the copy command should preserve it.

Warnings for incompatible options (--append and --inplace) might be appropriate, but they're not required as the use of the copy command would bypass them anyway.