birkenfeld / fddf

Fast data dupe finder
Apache License 2.0
109 stars 9 forks source link

Ability to detect duplicate folders #4

Open Boscop opened 7 years ago

Boscop commented 7 years ago

If all files of a folder have dupes in another folder, the output can get very verbose and it's not exactly clear from looking at it. It would be very helpful if fddf could summarize that as folder dupes (or subset). Because the primary use case for me is figuring out which files I can/should delete. If I could decide on the level of folders that would reduce the time it takes to sort through all the dupes.

Btw, here's a result I got, it took 12 mins and consumed 70 MB RAM on Win 8.1 64 bit. Most files in that folder are small files (<100KB, and the larger ones aren't much larger):

Overall results:
    16963 groups of duplicate files
    32744 files are duplicates
    1.2 GiB of space taken by dupliates
birkenfeld commented 7 years ago

Good idea! I'll consider this for the next version.

As for the results, the timing will depend basically exclusively on I/O speed if the files aren't hot. A second run should be faster, although that depends on OS caching behavior which I don't know very well for Windows.

Boscop commented 3 years ago

I'm still very interested in this feature :) The way it could work is that each folder gets a hash based on hashing all the hashes of its contents (files and subfolders). And then you could detect duplicate folders by storing the folder hashes in a HashMap<Hash, PathBuf> (pseudo code) and iterate over all folders to check if their hash exists in the HashMap (with a different path), then it's a duplicate folder. (Or HashMap<Hash, Vec<PathBuf>> to aggregate all duplicate folders for each hash.) This would only find exact dups, which would be enough for my use case (deduplicating backed-up folders from years of unorganized manual backups). For detecting almost-dups, it would be better to compare each folder with other folders of the same name. (Another approach would be using algorithms for graph similarity / subgraph matching.)