I'm curious why you decided on blake3 instead of a faster non-cryptographic hash like twox-hash.
Is it to keep the number of collisions (== the number of files whose contents have to be compared) as low as possible?
Have you done any benchmarks comparing blake3 with a faster non-cryptographic hash to see which one scans faster on a typical scenario (e.g. different percentages of duplicates)?
Most files that are different and have the same hash are probably very different early on, so their byte-by-byte comparison would terminate early. Maybe it would be faster to incur more collisions if the false positives terminate early?
I'm curious why you decided on blake3 instead of a faster non-cryptographic hash like
twox-hash
. Is it to keep the number of collisions (== the number of files whose contents have to be compared) as low as possible? Have you done any benchmarks comparing blake3 with a faster non-cryptographic hash to see which one scans faster on a typical scenario (e.g. different percentages of duplicates)? Most files that are different and have the same hash are probably very different early on, so their byte-by-byte comparison would terminate early. Maybe it would be faster to incur more collisions if the false positives terminate early?