Closed NoHatCoder closed 5 years ago
Knowing the default keys, it is expected to be trivial to reverse engineer and generate multiplications by 0. This is an issue if facing an opponent which intentionally wants to generate collisions (as opposed to a random occurrence). Note that xxHash being non cryptographic, it's not supposed to offer any protection in this space.
Nonetheless, for such case, the long term plan is to allow insertion of custom keys. This will make is sensibly more difficult to determine which exact input value can achieve this impact.
Finally, suggestions for algorithm improvements are highly welcomed. It's the purpose of the test period to gather feedback and use them to improve the characteristics of the hash.
Yeaaah, that's too easy… and that is a seed-independent issue.
What is the seed supposed to do then? Usually the idea of a seeded hash function is that you can provide some or all of the guarantees of a cryptographic hash function, provided that you keep the seed secret. But at the very least changing seed should be a way of rerolling collisions.
As for contributing, the way I see it, the world doesn't need more hash functions, unless you can make a function that improve respectably upon the current best offerings you might as well not bother. I don't see a way of fixing this hash without severely reducing the speed, and what is the point then?
^ Look at City64 and Murmur2/3. They were both completely broken using a similar seed-independent collision and that left them wide open for attacks.
Sure, you can change the magic numbers used for the hash, but that will lead to its own issues and you are basically rewriting the hash function.
I had to rewrite my NEON implementation for this same reason, so I think fixing this isn't too big of a deal.
int main()
{
char testbuffer[9] = { 0 };
long long int now = time(NULL);
*(unsigned int*)testbuffer = -0xb8fe6c39;
uint64_t lasthash = XXH3_64bits_withSeed(testbuffer, 8, now);
printf("%llu\n", lasthash);
long long i = 0;
for (int a = 0; a < 256; a++) {
testbuffer[4] = a;
for (int b = 0; b < 256; b++) {
testbuffer[5] = b;
for (int c = 0; c < 256; c++) {
testbuffer[6] = c;
for (int d = 0; d < 256; d++) {
testbuffer[7] = d;
for (int e = 0; e < 256; e++) {
testbuffer[8] = e;
if (lasthash != XXH3_64bits_withSeed(testbuffer, 8, now))
abort();
i++;
}
}
}
}
}
printf("%lli collisions\n", i);
}
4395661683124655645
1099511627776 collisions
…
I guess an opened question is : should a non-cryptographic hash offer a guarantee of zero seed-independent collision ? I thought the definition of non-cryptographic implies there is no such guarantee. As mentioned by @easyaspi314 , it's simply a common (and therefore assumed) situation for non-cryptographic hashes.
The seed so far is only intended to change the outcome of the hash function. This is useful when a user wants to quickly append a short piece of information (say, an identifier) to a content, and generate a hash value for the combination of both. An alternative is to append the identifier to the input, but it's not always possible (input is typically read only). Another solution is to pass both parts successively to a streaming API, but it loses a ton of performance in the process.
Offering a guarantee of no seed-independent collision is a much higher level of protection than non-cryptographic.
For such need, XXH3
plan to make the key (the list of 32-bits random numbers) customizable.
While the key could come from any source, a key generator is also considered, it would use an input of any length to generate a predictable key.
If it's a question of naming, I'm fine with swapping names around :
for example, the seed
could become a tag
, the key
could become a seed
.
I guess an opened question is : should a non-cryptographic hash offer a guarantee of zero seed-independent collision ?
Nah, that is a pretty basic expectancy.
int main() {
uint64_t testbuffer[2]= {0};
int64_t now = time(NULL);
testbuffer[0] = -0x23a44bbeb8fe6c39;
uint64_t lasthash = XXH3_64bits_withSeed(testbuffer, 16, now);
uint64_t collisions = 0;
uint64_t i = ULLONG_MAX;
while (i-- > 0) {
testbuffer[1] = i;
uint64_t newhash = XXH3_64bits_withSeed(testbuffer, 16, now);
if (newhash != lasthash)
abort();
collisions++;
}
printf("%llu collisions\n", collisions);
}
I can get ULLONG_MAX
collisions. :sob:
~/xxh3 $ time ./a.out
18446744073709551615 collisions
real 0m0.010s
user 0m0.002s
sys 0m0.003s
Back to the drawing board, definitely.
Actually, it is so bad that Clang actually constexpr's the entire thing (GCC can only do it when the 128-bit multiply is replaced with the scalar version):
main: ## @main
## %bb.0:
push rax
xor edi, edi
call time
lea rdi, [rip + L_.str]
mov rsi, -1
xor eax, eax
call printf
xor eax, eax
pop rcx
ret
(I'm using my "clean reference" implementation I was about to upload here to make the algorithm clearer)
Wouldn't this be better?
static XXH64_hash_t
XXH3_len_4to8_64b(uint8_t const* const data, size_t const len, uint32_t const* const key, XXH64_hash_t const seed)
{
uint64_t acc = PRIME64_1 * ((uint64_t) len + seed);
uint32_t const dataL = XXH_read32(data);
uint32_t const dataH = XXH_read32(data + len - 4);
uint32_t const l1 = dataL + key[0];
uint32_t const l2 = dataH + key[1];
acc += (uint64_t) dataL | ((uint64_t) dataH << 32);
acc += (uint64_t) l1 * (uint64_t) l2;
return XXH3_avalanche(acc);
}
static XXH64_hash_t
XXH3_len_9to16_64b(uint8_t const* const data, size_t const len, uint32_t const* const key, XXH64_hash_t const seed)
{
uint64_t acc = PRIME64_1 * ((uint64_t) len + seed);
uint64_t const dataL = XXH_read64(data);
uint64_t const dataH = XXH_read64(data + len - 8);
uint64_t const ll1 = dataL + XXH3_readKey64(key);
uint64_t const ll2 = dataH + XXH3_readKey64(key + 2);
acc += dataL;
acc += dataH;
acc += XXH3_foldedMult128(ll1, ll2);
return XXH3_avalanche(acc);
}
It's basically the same thing as XXH3_accumulate_512
, and it only takes a few more instructions.
I've no problem re-using the formula for long inputs onto short inputs too. This will indeed cancel the multiplication by zero issue.
I'm more concerned by the kind of confusion of objectives that I discern in this thread. The purpose of a non-cryptographic hash is to ingest any arbitrary input, and generate a well-spread result with good randomness properties. It is not to pass unscathed an input which was specifically crafted for the sole purpose of defeating the algorithm. It should come as no surprise that a non-cryptographic hash is not cryptographic.
In the above comments, I'm actually more concerned by the reduction in space created by a zero multiplication. Even then, this reduction is only detrimental in the 4-8 bytes range for 64-bit hash, and in the 9-16 bytes range for the 128-bit hash. After that point, a reduction is necessary by definition, so starting the reduction early or later doesn't change the outcome.
That being said, in adopting the formula of long inputs for shorter ones, I could as well generalize it for any input length >= 4, just as a matter of consistency. Only the 1-3 range will be different, though I don't think it matters much (this one can't multiply by zero to begin with).
Right, obviously, a non-cryptographic hash isn't intended to be perfect, but it should attempt to be decent. If more than 0xFFFFFFFFFFFFFFFF
collisions can be generated using a for loop, that isn't decent, that is a security vulnerability.
That is why Ruby had to switch hash tables, because give the Hash that easily crafted data and boom, denial of service.
It's a good thing we caught it now, tbh.
In branch xxh3
, I started modifying the algorithm for short inputs in the following way :
seed
alters the keys
, so that guessing which exact value can trigger a zero-multiplication is now both a function of the key
and the seed
.The speed seems slightly slower in the 4-16 length range, but this is within my benchmark noise level, so it's pretty small. I'll need more measurements to ensure accuracy.
I'm concentrating on the _64bits
variant for the time being.
The _128bits
one will need a more complex refactoring, so will come at a later stage.
xxhsum
self-tests are disabled while fiddling with the formula.
- the
seed
alters thekeys
, so that guessing which exact value can trigger a zero-multiplication is now both a function of thekey
and theseed
.
You still got 0 multiplication happening, and even without zeroes there is also a lot of different number pairs that produce the same result, the way you use multiplications is fundamentally unsound.
- input is also added into the accumulator, on top of the multiplication results, so that even a multiplication by zero cannot nullify a contributor.
That likely just changes the number you have to aim for in order to produce the same effect, if you for instance do:
acc += (in1+c1)*(in2+c2)+in1+in2
You can take in1
out of the equation by setting in2 = -c2-1
.
You can take in1 out of the equation by setting in2 = -c2-1
This is not what happens in this case.
We are using 32x32=>64-bit multiplication (or 64x64x=>128-bit ones, reasoning in the same).
Hence the addition is modulo 2^32
.
in2 (==-c2-1) + c2 = 0xFFFFFFFFU
, not -1
, which should be 0xFFFFFFFFFFFFFFFFULL
.
Therefore it cannot cancel out + in1
.
In branch xxh3
, I completed the changes mentioned above on the _64bits
variant, which are now applied at all key sizes.
I still need to update the self-tests in xxhsum
to reflect the new formula.
@easyaspi314 : on the NEON code path, some changes are needed.
These changes are pretty small in term of nb of lines, and insensitive for performance.
I just checked, and your old hash function is also susceptible to seed-independent collisions, here is for instance 100 almost-guaranteed collisions:
#include "xxhash.h"
#include "stdint.h"
#include "stdio.h"
int main(int argc, const char** argv) {
uint32_t testbuffer[8];
uint32_t a;
for (a = 0; a < 8; a++) {
testbuffer[a] = 0;
}
uint32_t seed = time(NULL);
for (a = 0; a < 100; a++) {
testbuffer[0] = 3066638151 * a;
testbuffer[4] = -(2654435761 << 13) * 3066638151 * a;
printf("%u\n", XXH32(testbuffer, 32, seed));
}
getchar();
}
This was already known.
There are numerous issues all over the code.
Mainly it relies on multiplication between non-constant numbers, which in general is a bad idea, particularly it leads to spots of ignored input data when one of those numbers is 0, but it also just throws away data way too early.
The idea of using unrelated 64 bit accumulators does not fly for a 128 bit hash, if only one of the accumulators has its input changed random collisions will occur as often as with a 64 bit hash.
The seed is in most cases not applied properly, so it is easy to fabricate generic collisions that work regardless of the seed. It also does nothing to protect the input data. Seed and input is easily recoverable from hashes on small input.
Here is an example of collisions because of multiplication by 0 (code is untested because the build is broken):