Cyan4973 / xxHash

Extremely fast non-cryptographic hash algorithm
http://www.xxhash.com/
Other
9.14k stars 777 forks source link

XXH3 Fail Avalanche Tests from demerphq/smhasher #344

Closed gzm55 closed 4 years ago

gzm55 commented 4 years ago

How Reproduce

git clone https://github.com/demerphq/smhasher.git demerphq-smhasher
git clone https://github.com/Cyan4973/xxHash.git
git clone https://github.com/demerphq/BeagleHash.git # optional
cp xxHash/{xxhash.*,xxh3.h} demerphq-smhasher/
cd demerphq-smhasher

# check source version
git log -n 1 | grep 0762887
sha1sum xxh3.h | grep e41cf39a164c491ae86deb85c10051c5826e8f44
sha1sum xxhash.c | grep f1a286236f359bd1f42942025ea9a9067d42ebd7 
sha1sum xxhash.h | grep 14b6b68b53dd6bda4d12db6d304862b04a9447b4

# edit Hash.h main.cpp
~/demerphq-smhasher$ git diff Hashes.h main.cpp
diff --git a/Hashes.h b/Hashes.h
index 0484116..25ac5fa 100644
--- a/Hashes.h
+++ b/Hashes.h
@@ -10,6 +10,7 @@
 #include "MurmurHash3.h"

 #if defined(__x86_64__)
+#define XXH_INLINE_ALL
 #include "xxhash.h"
 #include "metrohash.h"
 #include "cmetrohash.h"
@@ -166,6 +167,10 @@ inline void xxHash64_with_state_test( const void * key, int len, const void *sta
   *(uint64_t*)out = (uint64_t) XXH64(key, (size_t) len, *((unsigned long long *)state));
 }

+inline void xxh3_with_state_test( const void * key, int len, const void *state, void * out ) {
+  *(uint64_t*)out = (uint64_t) XXH3_64bits_withSeed(key, (size_t) len, *((uint64_t *) state));
+}
+
 inline void metrohash64_1_with_state_test ( const void * key, int len, const void *state, void * out ) {
   metrohash64_1((const uint8_t *)key,(uint64_t)len,*((uint64_t*)state),(uint8_t *)out);
 }
diff --git a/main.cpp b/main.cpp
index 7d887a1..9670198 100644
--- a/main.cpp
+++ b/main.cpp
@@ -367,6 +367,9 @@ HashInfo g_hashes[] =
   { "xxHash64", "xxHash, 64-bit",
     64, 64, 64, 0x22B287C9,
     NULL, xxHash64_with_state_test },
+  { "xxh3", "XXH3, 64-bit",
+    64, 64, 64, 0x8710F469,
+    NULL, xxh3_with_state_test },
 #endif
 #if defined(__x86_64__)
   { "jodyhash32", "JodyHash for 32 bit (v5)",

./make_smhasher || : # generate VERSION.h
mkdir -p build
pushd build
cmake ..
make -j4
cp SMHasher ..
popd

# Avalanche Test
./SMHasher --test=Avalanche xxh3
./SMHasher --test=Avalanche xxHash64

Expected Result

All tests OK, xxHash64 passed.

Actual Result

xxh3 fails for len in [0-3] bytes.

easyaspi314 commented 4 years ago

How valid is this test?

Also, is this the latest version? Given that it is giving garbage results for 0-bit keys, this might be an older version where it returns 0 on zero-length.

gzm55 commented 4 years ago

it seems single round of XXH3_avalanche() for length 0-3 is not strong enough.

gzm55 commented 4 years ago

length 0~3 inputs go through only a single multiplication, it is not enough. we need another one:

static XXH64_hash_t XXH3_avalanche2(xxh_u64 h64)
{
    h64 *= PRIME64_2;
    return XXH_xorshift64(h64, 32);
}

// in XXH3_len_0to16_64b()

if (len) return XXH3_avalanche2(XXH3_len_1to3_64b(input, len, secret, seed));
return XXH3_avalanche2(XXH3_avalanche(XXH3_avalanche((PRIME64_1 + seed) ^ (XXH_readLE64(secret+56) ^ XXH_readLE64(secret+64)))));

Another way is to avalanche in rrmxmx way like XXH3_len_4to8_64b()

easyaspi314 commented 4 years ago
XXH3_avalanche2(XXH3_avalanche(XXH3_avalanche(...)))

Quelle est cette absurdité? This is more of a hack than a solution…

Thoughts:

It is possible that the 1-3 issue comes from the higher bits lacking entropy, due to only having 32 bits of the secret.

Any seed lacking the higher 32 bits will have the highest 31 bits guaranteed to be zero.

How does replacing it with this line fare?

xxh_u64 const bitflip = (XXH_readLE64(secret) ^ XXH_readLE64(secret+8)) + seed;

On 64-bit, the difference will likely be unnoticeable, and on 32-bit it will be not too bad (2 extra loads and 2 extra xors).

Alternatively, shift left the secret 32 bits perhaps? Would be faster on 32-bit (no adcs) but an extra instruction on 64-bit.

gzm55 commented 4 years ago

8 bytes secrets with single multiplication are still not enough.

Cyan4973 commented 4 years ago

My understanding of the avalanche test is that it's about the consequences of flipping one bit. On average, flipping one bit anywhere in the input is expected to change the 64-bit output by ~32 bits. In such a test, it should not matter what the secret is. It's almost a no-op. The real part is the multiplication, folding, and rotation.

I would like to look a bit more at this test, as I'm a bit surprised that the original SMHasher test does not find any problem at 3 bytes (24-bit) which is part of the test suite. There must be something different in the way it's tested.

easyaspi314 commented 4 years ago

I think this version uses 64-bit seeds.

gzm55 commented 4 years ago

for 1-3 length, using XXH3_mul128_fold64 instead of low64 multiply can pass the tests. ps, secret ^ another_secret is still a third secret, so we can reduce one xor. But this method is prefer 64bit machine.

xxh_u64 const bitflip = (XXH_readLE32(secret+4)) + seed;
xxh_u64 const keyed = (xxh_u64)combined ^ bitflip;
xxh_u64 const mixed = XXH3_mul128_fold64(keyed,PRIME64_1);
return XXH3_avalanche(mixed);

In total, 1-3 length needs at least 2 multiplication.

easyaspi314 commented 4 years ago

secret ^ another_secret is still a third secret

Ah so apparently you didn't read the comment I literally put directly above 1to3 where I explained why we were folding it like that… :unamused:

Constant propagation eliminates that fold on kSecret, while for custom secrets, eliminates + seed.

gzm55 commented 4 years ago

the rrmxmx way woks.

        xxh_u64 const bitflip = (XXH_readLE32(secret) ^ XXH_readLE32(secret+4)) + seed;
        xxh_u64 const keyed = (xxh_u64)combined ^ bitflip;
        xxh_u64 x = XXH_rotl64(keyed, 49) ^ XXH_rotl64(keyed, 24);
        x *= 0x9FB21C651E98DF25ULL;
        x ^= (x >> 35);
        x *= 0x9FB21C651E98DF25ULL;
        return XXH_xorshift64(x, 28);
Cyan4973 commented 4 years ago

Yeah, that sounds logical. I'm sure it can be made to work.

My issue here is that just changing len[1-3] in any minor way means another release cycle.

So I want to be sure that it's actually worth it. And for that, a look at the test itself seems necessary.

gzm55 commented 4 years ago

ok, if it's worth, here is another candidate for short key mixer borrowed from Pelle Evensen's splitmix64, which suits for lenth [0-8] and could be a little faster than rrmxmx.

static inline uint64_t xxh_splitmix64(uint64_t z, uint64_t len) {
       // no need to add a magic
        z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9ULL;
        z = ((z ^ (z >> 27)) + len) * 0xBF58476D1CE4E5B9ULL; // use same multiplicator
        return z ^ (z >> 31);
}
easyaspi314 commented 4 years ago

We definitely want to avoid changes to the algorithm if possible.

Basically, we need to...

  1. Come up with the new algorithm
  2. Test it a. Yann needs to reserve a special server at Facebook b. We need to make sure everything else passes. c. Make tweaks if necessary, lather rinse repeat.
  3. Make another release candidate
  4. Wait for feedback
  5. Then we can make a release.

If this test is indeed valid, then it might be worth an adjustment.

The things I am worried about is that the seed behavior is tested as if it were an input; a similar theme in the MomentChi2 test.

The seed changing one bit at a time is a very unrealistic scenario — typically the seed will be a reasonably random number, and it doesn't usually change.

We only adapted for MomentChi2 because it was a phony test that lowered XXH3's official SMHasher score, and many people judge hashes by that number alone. (It also was because we were already making changes to fix the PerlinNoise test).

easyaspi314 commented 4 years ago

For example, here is how len == 0 reacts with bitflipping in XXH3_64bits_withSeed:

xxh3_64b_seed_bitflip

And here is with len == 3:

xxh3_64b_avalanche_seed_len3

If that were the input changing, I'd be worried, but it isn't.

tommyettinger commented 4 years ago

I believe I have run the same tests from https://github.com/demerphq/smhasher/blob/master/AvalancheTest.cpp as the original reporter. I ran them on the current Git master of xxHash, as of April 1, 2020. The 0-bit, 8-bit, 16-bit and 24-bit key Avalanche tests fail, and those are the only ones, though the 0-bit key test fails on 93.26% of the 4096 bits tested. Using Evensen's rrmxmx is definitely overkill, and replacing the 64-bit avalanche function with Evensen's Moremur64 (or probably MurmurHash3's mixer, which Moremur64 is based on) seems to be just enough to eliminate all test failures. Using Moremur64 also slows down the average hash by several cycles per hash, so if there's a way to improve the quality of small-key hashes without changing the avalanche, that would be optimal.

I agree that this isn't too concerning because the keys here are much smaller than the outputs, so a tremendous variety of outputs can't be expected.

Cyan4973 commented 4 years ago

Thanks for the clear reproduction case. I confirm the test results for len within [0-3].

I'll now look at the test itself to understand what it does. If the issue is worth solving, it will have an impact on the algorithm stabilization release date.

Cyan4973 commented 4 years ago

So, while I'm not yet entirely certain that I fully understand the Avalanche scenario within demerphq's fork of smhasher, it already seems that, in comparison with the "regular" smhasher, this fork expects much more randomization guarantees when ingesting variations from seed.

Regular smhasher is mostly concentrated on input variations, and many tests do not even attempt to modify the seed. Even they do, the seed is reduced to a 32-bit value, showing that it seems of little importance. In contrast demerphq's fork changes function signature in order to handle seed of any length, and make sure to utilize the full width of the seed (clearly visible in test reports).

When it comes to the avalanche scenario, the difference is pretty big as "regular" smhasher doesn't use seed at all for this test. So I guess this is an obvious reason why it remained undetected so far.

Now a legitimate question is : does this scenario matter ?

It's true that I would expect random properties from input variations to be far more important than from seed variations, since usage of the seed is uncommon to begin with, and even when used, it tends to remain a constant. Yet, it doesn't mean that the second scenario doesn't matter at all. I could imagine that some users may rely on such a property, even in association with very small inputs. How important that is, at this stage I have no idea...

Cyan4973 commented 4 years ago

I'm still not completely certain if I'm using the demerphq fork correctly, but following the investigation on XXH3 avalanche results, I wanted to have a look at XXH128.

It seems that the avalanche test detects issues for XXH128 on a wider range of input sizes. Assuming that I'm using the test and integration properly, and since XXH128 passes the regular smhasher avalanche test, I suspect it's also related to avalanche tests based on seed variations.

The wider range of input sizes impacted make this issue look a bit more serious.

edit : false alert, the worse outcome was a consequence of incorrect integration. Now that XXH128 seems properly evaluated by demerphq, and it results in the same avalanche issue as XXH3, limited to len [0-3].

Cyan4973 commented 4 years ago

Improved in v0.7.4