adrianlopezroche / fdupes

FDUPES is a program for identifying or deleting duplicate files residing within specified directories.
2.42k stars 186 forks source link

option to call FIDEDUPERANGE on identical files #89

Open kilobyte opened 6 years ago

kilobyte commented 6 years ago

Using symlinks and hardlinks for deduplication tends to be a bad idea: linking is open to races, and changes the semantics of both copies: if you edit one of the files, the other will be changed as well.

Thus, Linux has a new ioctl, FIDEDUPERANGE. It's currently implemented on btrfs, xfs and ocfs2; it's expected on zfs too. It takes a hint from userspace that a range (fdA, offsetA, length) is same as range (fdB, offsetB, length), goes to compare them byte-by-byte, and if there's a match, reflinks them to the same on-disk extents. This avoids any races; reflinks also have the desired semantics of making the files fully independent from the user's point of view — they merely take less disk space.

https://github.com/markfasheh/duperemove has a --fdupes mode where it takes fdupes' output to do whole-file deduplication, but it'd be more convenient to have fdupes call the ioctl directly. It'd also remove the need for byte-by-byte comparing twice.

ddawson commented 4 years ago

This would be very useful. It could make fdupes a reasonable replacement for the now-defunct bedup. Currently, the best options I know of are to use either duperemove by itself, which dedupes individual extents (which may increase fragmentation—not necessarily good for HDD performance), or combine it with fdupes using duperemove's --fdupes option.

The latter option will yield acceptable results in the end. However, the procedure performs extra comparison of file contents that is not necessary in this case; it is sufficient to assume files with the same size and hash are identical, as the kernel must verify this anyway. Unless I am mistaken, the only potential problem would be a slight time loss in the rare event of false positives, but that would still be better than having every matching-hash pair compared twice. For this reason, I was considering requesting an option to skip the direct comparison step, provided the --delete option is not being used, of course, and have made my own patch against 1.6.1. But this feature would work as well, for me, at least. In either case, I think skipping direct comparison would be helpful.