Closed DonaldTsang closed 4 years ago
Some request for other repos to also have 256-bit hash versions (which already has 128-bit hash):
While xxh3
has enough margin to produce a 256-bit variant if need be,
it still is a lot of work and lot of code, with associated creation and maintenance costs.
And a problem is, I haven't found a use case where this capability would make sense.
There are multiple use cases which benefit from having more bits.
One of them is to use the produced hash as a bitmap, from which bit fields are extracted for various usages. Bloom filters are a good example, because they require many bitfields, and are able to require more than 64-bit in certain scenarios. But more than 128-bit ? I've never seen that.
Another case is checksumming : assuming perfect collision properties (which is not always the case, see this recent study), the probability of 2 objects generating the same hash is bound by the birthday paradox. With a 64-bit hash, the probability of 2 files generating the same hash starts to become non-negligible after several millions of files. Some large scenarios are able to reach that amount. For a 128-bit hash, getting in the same uncomfortable territory would require many thousands of billions of files ! This is generally way above any plausible scenario, and therefore, I have yet to find a single one where it could matter.
The only scenarios where 256-bit seems useful is for cryptographic applications : here, the larger bit space makes it impossible to "break" a hash (hence to intentionally generate a collision) with a brute-force attack. But for that to be true, one needs a cryptographic quality hash, that is, one where no known formula is able to "improve" probability of collision compared to brute-force. This is paid with a corresponding speed cost. xxHash is a non-cryptographic hash, so it doesn't claim to offer this property. Blake2 is an excellent crypto hash. And while it's branded "fast" (and yes, it is fast, for a cryptographic hash), it's still many times slower than xxHash. Different scenarios, different needs.
And that's it. I don't see any scenario that would actually require a non-cryptographic 256-bit hash.
I've read the thread where you mention that your needs are :
if the server has 2^23 files (a small site)? what about 2^28 files (a large media company)?`
These quantities are suitable for a 128-bit hash. The chances of collisions are way too low to matter in practice (smaller than 1 / 2^64).
@Cyan4973 bloom filters (for text) and ultra-fast non-secure checksumming are, indeed some of my intended goal. But the problem is, that nobody has ever made a benchmark for 128-bit and 256-bit hashes as a way of comparison, so there is a need for that.
A 256-bit variant is necessarily going to be slower than a 128-bit one, due to the need to mix bits over a wider set of accumulators.
So this speed cost can only be justified if it brings some advantage.
For checksumming, I don't think there is any justifiable scenario where 256-bit offers any uniqueness advantage over 128-bit. Even if one scenario must deal with a thousand of billions of unique elements, 128-bit is still good enough.
For generation of "any number of bitfields", one must find a case where more than 128-bit are needed. Even in this case, a quick workaround is to rehash the initial 128-bit to produce another 128-bit to pick from.
It's obviously not the same as a 256 bit hash, and this method would be bad for checksumming, but for the purpose of generating multiple smaller bitfields, it's good enough: any difference will be hard to measure (one could use sha256
or blake2b
for a comparison on collision rates).
256 bit or no, having benchmarks for the 128 bit variant along side the 64 bit one would be nice.
FWIW, using the processor level AES 128 might be competitive -- two rounds is sufficient to produce a good hash that is not cryptographic. (I got the idea from here: https://openjdk.java.net/jeps/8201462)
For anyone wanting a 256 bit hash, using processor intrinsic AES instructions might be faster than the large number of multiplies, shifts, rotates that would be needed for an accumulator 256 bits wide -- not necessarily a full official AES hash, but a truncated process with fewer rounds that satisfies the mixing and randomness properties but is not cryptographic.
There will be an update of benchmark, featuring 128-bit variants.
@Cyan4973 List of 128-bit hashes
There are many 64-bit variants in the list.
Note that, if you are interested in a quick speed comparison between any of these variants, you can also do it directly by pluging them into the benchmark program at https://github.com/Cyan4973/xxHash/tree/dev/tests/bench .
@Cyan4973 I am referring to the list of 128-bit variants on the list, the MRX list is broken as there are no documents saying which 6 out of 10 has 128-bit variants.
Didi you ran any tests for Spookyhash (V2)?
@Cyan4973 - There are definitely more applications for hashes with a large number of bits for streaming algorithms. I am using multiple 128b hashes for several things :).
@Alain2019 SpookyHash is listed in https://github.com/Cyan4973/xxHash/wiki/Performance-comparison
@Alain2019 SpookyHash is listed in https://github.com/Cyan4973/xxHash/wiki/Performance-comparison
Thanks a lot, a small correction : SpookyHash is a 128bit hash, which can be used as 64bit or even 32bit.
Good point.
I've been testing SpookyHash 's Hash64()
entry point,
but indeed, I see now that there is also a Hash128()
symbol available.
It seems they should have similar performance, but that would nonetheless deserve a check, just to be sure. Sometimes, compiler can pull off pretty clever optimizations on realizing that only one part of the outcome is effectively being used.
This issue has morphed into a request to benchmark a lot of (sometimes experimental) hash algorithms for comparison purposes.
I believe there's a misunderstanding in the purpose of this repository. Its only goal is to provide the xxhash library, and present it in context, hence compared to a few relatively well known algorithms and reference implementations.
It is not to provide a full comparison exercise of every hash algorithm available. For such an objective, I would rather recommend using some dedicated website, which focuses mostly or solely on this objective, and which preferably has a main author that doesn't have its own algorithm in the race, for (hopefully) improved neutrality.
@Cyan4973, in that case, would it be possible to start a Telegram channel, a Discord or IRC group, or even a mailing list for such purposes? If so, who should we invite into the community? All ideas are welcomed.
I know that there is an organization already doing it for cryptographic hashes, but we are targeting "fast" hashes. for reference: https://bench.cr.yp.to/results-hash.html
Well, that's a difficult question.
The reference you provide on speed evaluation of cryptographic hashes is excellent. Finding an equivalent source for non-cryptographic hashes would be great, but I don't know of anything close.
@rurban maintains an interesting evaluation of hashes. It's more concentrated on quality, which is fair, as otherwise it would be too easy to evaluate some fast "hash" with terrible quality. But it also collects a few elements about speed. So it could be a nice starting point.
Let's be clear, it's a lot of work. Github may simplify a few important issues, such as creating and maintaining a public web page, but it still requires a lot of efforts to create something, even more so to keep it up to date.
List of 256-bit hashes
Main benchmark goal: