Cyan4973 / xxHash

Extremely fast non-cryptographic hash algorithm
http://www.xxhash.com/
Other
9.02k stars 771 forks source link

New PerlinNoiseAV test in SMHasher #738

Open avaneev opened 2 years ago

avaneev commented 2 years ago

Sorry to write this directly as I'm an "interested person" having my own hash functions, but recently the claim "xxhash passes all SMHasher tests" became imprecise: I've invented a new Perlin Noise test that finds "holes" in several hashes, not just xxhash. Maybe you can fix this as xxhash is too popular to miss that fact. I'm not expecting to gain anything from my post; my komihash hash function is relatively new, so needs a lot more time to gain traction in this very competitive "hash function" field.

Maybe you can also add komihash to your tests, it's probably not as fast as xxhash3, though, but useful as a comparison to the "state-of-the-art" hash functions. (wyhash comparison would be great as well)

https://github.com/avaneev/komihash https://github.com/rurban/smhasher

Cyan4973 commented 2 years ago

Could you point at the Perlin Noise test you refer to ? The only one I could find in rurban's smhasher is : https://github.com/rurban/smhasher/commit/64b7d10ea92701e707ecba5f6b0bacbdb3d98b8a

t-mat commented 2 years ago

I'm not sure but this issue may suggest the following new result of 'PerlinNoiseAV' test:

https://github.com/rurban/smhasher/commit/9ca488454eda4dae9ac795147696135e5cdf7dbe

In these new results, 'Perlin Noise' test is traditional perlin noise test.

https://github.com/rurban/smhasher/blob/9ca488454eda4dae9ac795147696135e5cdf7dbe/doc/xxh128.txt#L816

But 'PerlinNoise' test contains new PerlinNoiseAV (AV variant) test.

https://github.com/rurban/smhasher/blob/9ca488454eda4dae9ac795147696135e5cdf7dbe/doc/xxh128.txt#L859

jemfinch commented 1 year ago

I dug into this and found a few things:

  1. To answer your previous question, the problem is in PerlinNoiseAV: https://github.com/rurban/smhasher/blob/master/KeysetTest.h#L358
    Testing collisions (128-bit) - Expected    0.0, actual  24116 (1711423593602593105674976100352.00x) (24116)
    Testing collisions (high 64-bit) - Expected          0.0, actual  24116 (92776458911.34x) (24116) !!!!!
    Testing collisions (high 32-bit) - Expected       1116.2, actual  25260 (22.63x) (24144) !!!!!
    Testing collisions (high 25-37 bits) - Worst is 37 bits: 24162/34 (692.56x) !!!!!
    Testing collisions (low  64-bit) - Expected          0.0, actual  24116 (92776458911.34x) (24116) !!!!!
    Testing collisions (low  32-bit) - Expected       1116.2, actual  25187 (22.57x) (24071) !!!!!
    Testing collisions (low  25-37 bits) - Worst is 37 bits: 24153/34 (692.30x) !!!!!
  2. That keyset consists entirely of messages with length of 16 through 38 bytes, consisting almost entirely of zeroes, with one nonzero 32-bit value at some location in the message.
  3. The collisions with xxh128 happen entirely on 16-byte messages, so the problem is in XXH3_len_9to16_128b.
  4. Visually, it appears that the collisions happen almost entirely when both the seed and the key are values with very few (2-4) bits set. For instance, here's one of the hash collisions, printed as bitsets:
    hash=11111111111101000100000010101111001000111110101100110000001100001000010010111000001100101010110000111110000101100001111101110011
    seed=00000000000000000000000100010000 key=00000001000000010000000000010000
    seed=00000000000000000000000100010001 key=00000001000000010000000000010001
    seed=00000001000000000000000100010000 key=00000000000000010000000000010000
    seed=00000001000000000000000100010001 key=00000000000000010000000000010001
  5. A simple fix that doesn't seem to impact performance (at least on my my machine) is to improve the seed: seed = XXH3_mul128_fold64(seed, PRIME64_3); With this line added at the beginning of XXH3_len_9to16_128b, PerlinNoiseAV passes:
    Testing AV variant, 128 count with 4 spacing, 4-12:
    Testing collisions (128-bit) - Expected    0.0, actual      0 (0.00x)
    Testing collisions (high 64-bit) - Expected          0.0, actual      0 (0.00x)
    Testing collisions (high 32-bit) - Expected       1116.2, actual   1137 (1.02x) (21)
    Testing collisions (high 25-37 bits) - Worst is 37 bits: 48/34 (1.38x)
    Testing collisions (low  64-bit) - Expected          0.0, actual      0 (0.00x)
    Testing collisions (low  32-bit) - Expected       1116.2, actual   1109 (0.99x) (-7)
    Testing collisions (low  25-37 bits) - Worst is 35 bits: 140/139 (1.00x)
  6. In case any other length-specific functions struggle with seeds that don't have very many bits set, you can lift that XXH3_mul128_fold64(seed, PRIME64_3) all way up to XXH128 without (apparently) impacting performance on small messages or performance on SMHasher (apart from fixing this bug, of course). And, of course, callers can simply do this on their side without changing XXH128 at all.
avaneev commented 1 year ago

Yes, a reasonable fix (wyhash did that as well recently) is to add an additional "input seed" shuffling round.