Possible speed improvement: Don't compute hash of whole big file at first

Harvie commented 7 years ago

I am using fdupes on 6TB archive of audio files (usualy ranging from 100MB to 3GB) and it takes around 30 hours to complete. I have idea how processing of such big files can be sped up.

Make md5 sum only of first 1MB of the file. If file is bigger than 1MB there should be stored flag that the file was truncated. Also size should be stored along with that hash. Then later when there are more files with same truncated hash and same size, hashes of full file should be computed to check if files are really duplicate in full length.

This would eliminate computing of hashes of big files if it's not really necessary (=when there is no other file with same size and same 1MB at the beginning).

Further improvements to this approach:

1.) Truncate file to 1MB only when it's bigger than 3MB. Doesn't make much sense to truncate 1.2MB file to 1MB and then compute the whole hash again.

2.) Do not truncate file by just taking first 1MB, but rather concatenate 0.3MB from beginning, 0.3MB from middle and 0.3MB from end and then make truncated hash of this. This should allow performance improvement even for cases where you have lots of big different files that have duplicate header or footer. This should be configurable, because i am not sure that seeking in file will not degrade performance.

3.) Fiddle with values. Bigfile limit (3MB) and truncate length (1MB) should be configurable, so we can experiment and find most effective defaults. Maybe 1MB is too much. Maybe truncating to 1024B for files bigger than 1MB will be enough.

wmertens commented 7 years ago

In fact, won't just storing the length already help? And only when you encounter two files that are the same size, do you calculate header/footer/middle/all checksums.

Harvie commented 7 years ago

But when fdupes is computing hashes, the rate of hashing varies quite a lot. My only explanation is that it performs poorly on bigger files. Is there any other possible explanation why some files are computed quickly (few hashes per seconds) and others take few seconds for one hash. Obviously first reason might be cache, but the archive is so big in comparison to RAM that i think this is not the case. Other reasons might be that there are significant differences in hashing time for different files. And size is only attribute of file that makes sense in my eyes. But maybe fdupes has problems with storing large amounts of hashes. I don't know how they are stored. But maybe it slows down once there are many hashes in db. Or it waits from time to time until more space is allocated.

run-dey commented 6 years ago

when there are more files with same truncated hash and same size, hashes of full file should be computed to check if files are really duplicate in full length.

There are cases where this is not necessary, because the user knows beforehand that the truncated hash is enough to establish the file's identity. I suggest an option for skipping this phase and treating the truncated-hash (of configurable length and location) as the total-hash.

Sent from a disposable GitHub account.

adrianlopezroche commented 6 years ago

Never going to happen, run-dey, but thank you for your suggestion.

run-dey commented 6 years ago

I imagine that you consider that feature out of scope, because its presence would imply that fdupes is not being 100 % thorough with respect to identity-determining? I wouldn't consider it that way; it's simply that a part of responsibility for ensuring the files' identity falls on the user, which vouches for the truncated-hash's adequacy.

Or do you not want to set a precedent for custom identity criteria features? This is reasonable. In such case, maybe consider the option to pass a custom handler (eg. path to a shell script) responsible for returning a fingerprint for each encountered file, while retaining the filesystem traversal infrastructure, deletion interface, etc. In absence of such a handler, full hash would be the fingerprint.

(Edit: as a matter of fact, you may even require that the passed handler returns a hash of its product using the same hash function as is currently used by fdupes, so that you don't have to support variable-length fingerprints in the core if it is troublesome.)

adrianlopezroche commented 6 years ago

It adds complexity for the sake of a feature most people should not be using. I believe in protecting users from themselves.

run-dey commented 6 years ago

protecting users from themselves

This concept hardly applies to command line/scripting programs. Command line parameters are opt-in; there is no GUI in which the user is tempted to click buttons at random for exploration's sake, with unknown results. A parameter is hidden until it is read about in the help page and thus understood. It may even have no short form, eg. no -h, but only --footprint-handler.

a feature most people should not be using

Why not? Identity need not necessarily be understood in total terms. There is plenty of use cases in which identity is understood only as a function of a file's contents, f(x) rather than the whole of x. For instance, one may wish to find all duplicates of a text document based on its extracted title. Not to mention that such customization would allow support of arbitrary hash algorithms, which apparently more than one person has opened an issue about.

Harvie commented 6 years ago

@run-dey maybe open separate ticket about supporting external hashing scripts?

springfielddatarecovery commented 5 years ago

This would be a great feature. I have seen other de-duping programs that first check the first x bytes of a file and if it has a match, it checks the last x bytes of a file and then the middle x bytes and only if all those match does it read the entire file. The reason to check the middle is that some file formats will always have the same first and last bytes (headers etc) but won't have the same middle.

quixand commented 4 years ago

something like https://github.com/MatthewJohn/shamean?

jbruchon commented 4 years ago

The program already does partial hash exclusion. Please stop requesting a feature that already exists in the code. If you want to do dangerous things like considering partial hashes as full file comparisons, fork the code, add that "feature," and destroy your data yourself.

adrianlopezroche / fdupes

Possible speed improvement: Don't compute hash of whole big file at first #79