adrianlopezroche / fdupes

FDUPES is a program for identifying or deleting duplicate files residing within specified directories.
2.43k stars 187 forks source link

Detect duplicate subtrees #77

Open Harvie opened 7 years ago

Harvie commented 7 years ago

I know this will probably be kinda hard to implement... But it would be cool if we can detect duplicate trees/directories. Currently if you have two completely identical directories with completely identical files you will never know and you will have to delete them file by file. Also you will be left with empty directory.

But what if we make directory hash based on hashes of all files in the directory? (ignoring filenames). Then we can even make this recursive to subdirectories to detect whole duplicate trees. But then we might need to remove the individual files from dupes output so it will get shorter (which is point of this).

I am trying to handle MASSIVE ammount of duplicities. (output file of fdupes has 18 MB). I guess this can be greatly reduced if we manage to find duplicate subtrees, because lots of this are traditional "backups of backups of backups of backups". If i can handle whole subtrees as one item rather than file-by-file it would greatly reduce the effort needed to dedup such storage.

Harvie commented 7 years ago

Surely the hashes of files has to be sorted. And i don't want to delete whole directory if there's one different file. Because i don't want to loose that file. In such case you will have to work it out file by file.

using file finding and text processing tools and temp files

WOW! with such approach the fdupes would have never existed, because it can be completely replaced with "file finding and text processing tools and temp files". But it's just easier to have state of art tool that doesn't require any ad hoc programming to get stuff done.

hellyberry commented 7 years ago

Such a subtree-detection would be a tremendeous time-saving and also a security-feature for efficiently handling backups of backups(-of-b....), a frequent use case.

pabloab commented 6 years ago

Would be something similar to rmlint -T dd like said here?