adrianlopezroche / fdupes

FDUPES is a program for identifying or deleting duplicate files residing within specified directories.
2.43k stars 187 forks source link

Proposal for new core digest algorithm: BLAKE2 instead of MD5 #86

Open hellyberry opened 7 years ago

hellyberry commented 7 years ago

Maybe the BLAKE2-algorithm solves both needs in software, fast and secure. It is 256 bit in size, cryptographically sound, so low chance of 'false' collisions and relatively high speed. Maybe for examining MD5 potential collisions, one can create a BLAKE2 digest of the files in question, this should be quicker than byte-by-byte comparison and is cryptographically secure. See http://blake2.net

hellyberry commented 7 years ago

And the good thing, there's ready-to-use parallelized code for SMP systems (on github), so running 4/8 threads is possible. This should speed up quite a bit.

QUOTE: BLAKE2 is a cryptographic hash function faster than MD5, SHA-1, SHA-2, and SHA-3, yet is at least as secure as the latest standard SHA-3. BLAKE2 has been adopted by many projects due to its high speed, security, and simplicity.

BLAKE2 is specified in RFC 7693, and our code and test vectors are available on GitHub, licensed under CC0 (public domain-like). BLAKE2 is also described in the 2015 book The Hash Function BLAKE.

BLAKE2 comes in two flavors:

BLAKE2b (or just BLAKE2) is optimized for 64-bit platforms—including NEON-enabled ARMs—and produces digests of any size between 1 and 64 bytes

BLAKE2s is optimized for 8- to 32-bit platforms and produces digests of any size between 1 and 32 bytes

BLAKE2 includes the 4-way parallel BLAKE2bp and 8-way parallel BLAKE2sp designed for increased performance on multicore or SIMD CPUs. BLAKE2 offers these algorithms tuned to your specific requirements, such as keyed hashing (that is, MAC or PRF), hashing with a salt, updatable or incremental tree-hashing, or any combination thereof. These versions are specified in the BLAKE2 document.

BLAKE2 also includes the BLAKE2x variants, which can produce digests of arbitrary length. BLAKE2x is specified in a separate document.

BLAKE2 shines on 64-bit CPUs: on an Intel Core i5-6600 (Skylake microarchitecture, 3310MHz), BLAKE2b can process 1 gibibyte per second, or a speed rate of 3.08 cycles per byte. (from http://blake2.net/)

hellyberry commented 7 years ago

I see. So fdupes and jdupes work same in this regard, but are different in the hashing-algorithm used. What do you mean by exclusionary comparison? Is first the file-size checked, then the MD5 (in fdupes)/jodyhash (in jdupes) if a file has same (non-zero) size?

kilobyte commented 7 years ago

MD5 is so broken it allows computational complexity attacks: that is, a malicious user who can give you a bunch of files can force fdupes into comparing a large number of files pair-by-pair. These files will be crafted to have the difference only on the last block.

On the other hand, while it is theoretically possible for a non-broken hash to produce collisions, comparing the files is totally pointless. The definition of a "broken" hash is one where you can produce collisions cheaper than via brute force (ie, the birthday attack). BLAKE2 (512 bit) has birthday length of 256 bits — that is, to have a fair chance to produce one collision, you'd need a computer with temp memory that stores 1 bit on every nucleon in the observable universe; in this case you can expect around 1 collision. Half-length — that is — 256-bit version of BLAKE2 lowers the requirement to about 1 bit per nucleon in the mass of Earth. I'm quite certain there's still a good few years before such storage capabilities will be plausible.

This is insanely less likely than the chance you'll be hit by three different lightning bolts out of clear sky today.

Any currently available computer also suffers from cosmic rays, thermal glitches and so on. Any, no matter how correct, algorithm, will make mistakes because of CPU errors.

Thus, until BLAKE2 is broken, it's irrational to re-check the files. Also, the recent SHA1 break required 5000 core-years to produce a single collision, so there's plenty of warning beforehand.

Yet another reason is that the current most important use case for fdupes, FIDEDUPERANGE, already does the byte-by-byte comparison inside the kernel (to avoid otherwise unavoidable races), thus repeating that work in userspace merely wastes time.

FabioPedretti commented 3 years ago

Blake3 is even better, faster and highly parallelizzable on multi core CPUs: https://github.com/BLAKE3-team/BLAKE3