RsyncProject / rsync

An open source utility that provides fast incremental file transfer. It also has useful features for backup and restore operations among many other use cases.
https://rsync.samba.org
Other
2.51k stars 311 forks source link

detect renamed files and avoid file transfer #590

Open tridge opened 3 months ago

tridge commented 3 months ago

This is a copy of the old bugzilla issue from here: https://bugzilla.samba.org/show_bug.cgi?id=2294 this certainly would be a big win in many cases. It is complicated by the incremental method of calculating the hashes (we don't hash the full file list before starting transfers).

tridge commented 3 months ago

note that --fuzzy is a partial handling of this issue, key problem is it only looks in the same directory. Extending this to be able to look across the whole destination tree, perhaps with sort by file size for faster matching, would make it more useful

Tunoac commented 3 weeks ago

I would like to propose a --fuzzy2 option in rsync, which also considers the entire tree.

I have a prototype written in awk, which I currently run before the actual rsync run.

It's not perfect, could be done better probably, but it works and saves me a lot when doing file/folder renames. It will only move files, not folders. Awk delta calculation for 10K Files in source * 10K Files in target ~ 0,3s

Creating the required folder tree in target must be done at first, e.g. with rsync :-) rsync -a --include='*/' --exclude='*' "${sourepath}" "${targetpath}"

AWK: source and target infos must be put into an array:

array format:

    Files_last_modification_time _ filesize filepath
    e.g.
    1718541359.8524070000t-147  /home/claus/.bashrc
    1717861293.8939940000t-57   /home/claus/.bash_profile

The first column is the key-id, here date + size. 2nd column has the file path. Date+size key must be replaced by hash, when using --checksum.

populate the array:

aa  source array
bb  target array

awk main loop (x == key-id)
( aa[x] == file path source) ( bb[x] == file path target)

for (x in aa) {
    if (x in bb) {
        if (aa[x]!=bb[x]) { print "mv --no-clobber targetpath""bb[x]"  "targetpath""aa[x]" }
        delete bb[x]; 
    }
}

After reviewing and executing the proposed mv commands, I run the real rsync, which cleans up remaining things.