birkenfeld / fddf

Fast data dupe finder
Apache License 2.0
109 stars 9 forks source link

Speed up hashing with twox-hash #26

Open Boscop opened 3 years ago

Boscop commented 3 years ago

I'm curious why you decided on blake3 instead of a faster non-cryptographic hash like twox-hash. Is it to keep the number of collisions (== the number of files whose contents have to be compared) as low as possible? Have you done any benchmarks comparing blake3 with a faster non-cryptographic hash to see which one scans faster on a typical scenario (e.g. different percentages of duplicates)? Most files that are different and have the same hash are probably very different early on, so their byte-by-byte comparison would terminate early. Maybe it would be faster to incur more collisions if the false positives terminate early?