RsyncProject / rsync

An open source utility that provides fast incremental file transfer. It also has useful features for backup and restore operations among many other use cases.
https://rsync.samba.org
Other
2.71k stars 327 forks source link

--compare-dest copies all directories, even ones that contain no changes #530

Open eharris opened 11 months ago

eharris commented 11 months ago

I'm trying to use --compare-dest to create an incremental backup that only contains changed files and directories, suitable for rsync'ing back on top of an older full backup to bring it up to date.

The problem I'm experiencing is that rsync with the --compare-dest option appears to copy ALL directories, even ones that have no changed contents.

For example, on a source directory of about 1 million items (129k dirs and 880k files):

Using rsync with --compare-dest results in a destination with 129k dirs and 135k files (including 106k dirs that have no changes) Using rsync with --compare-dest -m reduces the copied dirs by about 9k, but still has 97k dirs that contain no changes. The actual number of directories that contain changes is only 23k. I have verified this using --itemize-changes (and filling in the extra dirs that are not reported as changed even though they do contain changes on a deeper branch).

To me, this behavior of --compare-dest seems wrong. Why is it preserving all the directories including ones that contain no changes? And why does the use of -m (which is undesirable since it may "lose" directories that actually have changed, such as a different mtime) still preserve so many unchanged directories that should be empty (and would be if other unchanged sub-directories had not also been improperly copied)?.

(Side note: the cleanup that -m should do but doesn't can be performed by a subsequent find dir/ -depth -type d -print0 | xargs -0 -- rmdir --ignore-fail-on-non-empty, however this results in directories that contained empty dirs that were cleaned up having the wrong mtime)

In an attempt to work around this, I have written a python script to try to get rid of all these unnecessary and unchanged directories by processing the output of --itemize-changes, and then giving that list of files/dirs to rsync as an explicit --include-from filter list.

The problem with this approach is that rsync gets massively slower and becomes cpu-bound when given a full filter list of items to include. In my testing with the same sources above, using rsync --compare-dest onto an already populated destination (fully cached and quiescent system) results in a run that takes less than 90 seconds. With the same conditions but using a --include-from filter list that includes 158k rules/items, the same run takes over 30 minutes, over 20 times slower, even though the destination contains over 100k fewer items (all directories).

I think that --compare-dest needs to be fixed to NOT copy directories that do not contain any changes at the current or any deeper level.

This is using rsync version 3.2.7 on Debian 11.

tmknight commented 11 months ago

Is it possible you are quoting the directory path used in --compare-dest=? I only just today resolved a similar issue as OP and resolved by removing quotes around explicit path (which was a variable, but also didn't work as a string)....which IMO is a bug.

It might be helpful if you shared the rsync command that produces the outcome you're reporting.

eharris commented 11 months ago

Here's a super simplified case to reproduce:

> mkdir test
> mkdir test/nochange
> cp -a test comp
> rsync -av --delete --compare-dest=`pwd`/comp/ test/ dest/
sending incremental file list
created directory dest

sent 91 bytes  received 40 bytes  262.00 bytes/sec
total size is 0  speedup is 0.00
> ls dest/
nochange

As you can see, the nochange directory was created/copied, even though it is identical (due to the cp -a) in the comp/ directory. Interestingly, rsync does not report it is creating it even though -v is active. (the pwd was necessary to make the --compare-dest parameter absolute)

tmknight commented 11 months ago

Try it with --omit-dir-times

eharris commented 11 months ago

@tmknight no change, as would be expected from the example I gave previously.

tmknight commented 11 months ago

I made some assumptions about your test. Try this:

mkdir -p src/test dst cmp echo this is a test > src/test0.txt echo this is a test > src/test/test.txt cp -a src/test cmp/ rsync -a --compare-dest=`pwd`/cmp/ src/ dst/

It is known that the directories are created whilst rsync traverses

eharris commented 11 months ago

@tmknight I don't understand what the point of your test case is, as it doesn't test the case that is the problem I'm trying to get addressed, which is that directories that have not changed (and have no descendants that have changed) are being copied/created in the destination even though they already exist with the exact same metadata in the --compare-dest target.

Your assertion that "it is known" to behave this way of course makes sense when --compare-dest is not in effect, since it should be making the destination identical to the source. The point of this ticket is to address the problem that it does NOT make sense to copy/create empty directories in the destination when no leaves below them have any changes (files OR directories) that are not already present in the --compare-dest target(s). Yes, this may make the traversal a bit more complicated, since ancestor-directory creation will need to be delayed until a descendant difference is found, but that seems like it should be a solvable problem.

It also makes no sense that the creation of those empty and unnecessary directories is not reflected in the output when -v is in effect.