Cyan4973 / xxHash

Extremely fast non-cryptographic hash algorithm
http://www.xxhash.com/
Other
9.1k stars 775 forks source link

Any known xxhash collision so far? #165

Closed hiqbn closed 5 years ago

hiqbn commented 5 years ago

Any known xxhash collision so far? Need some data for reference.

Cyan4973 commented 5 years ago

xxhash is a non-cryptographic hash algorithm. It must be assumed that collisions can be produced on demand. 32 and 64 bits are way too small anyway to provide any protection, and can be brute-forced.

hiqbn commented 5 years ago

I would like to know what values have this collision. does anyone have the values? as mentioned, need data for reference.

ramytarchichy commented 5 years ago

It's quite trivial to find some:

//add this to your compiler flags to run it faster: -O2 -march=native

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include "xxhash.h"

int main()
{
    const unsigned long long t = 1727396350;
    for(unsigned long s = 0; s < ULONG_MAX; ++s)
    {
        unsigned long long h = XXH32(&s, sizeof s, 0); //Replace XXH32 with XXH64 for XXH64 collisions
        if (h == t) printf("COLLISION!!!: %lu\n", s);
    }
}

I found a whole bunch of unsigned longs that all returned the same XXH32 value of 1727396350, some of which are listed down below:

90343688
4387316452
11971290027
16268262791
19557269070
25675257898
27143248113
33261236941
34729227156
40847215984
44136222263
48433195027
54551183855
56019174070
62137162898
65426169177
69723141941

The list goes on for a while... And I found all these in just a couple of minutes.

As for XXH64, it will take significantly more time to find a collision, which is to be expected since it has a 64 bit hash size, double that of XXH32. Not twice the amount of time mind you, but a lot more; since every bit doubles the amount of time required, it will (probably) take somewhere around 2^(64-32)=2^32=4294967296 times longer. That might sound like a lot, but considering I was able to find my first XXH32 collision in mere milliseconds, it might only take me a few years to find a collision on my laptop... without assembly optimization, or GPU acceleration, or FPGAs. An determined attacker could easily afford all of the above, and a bunch more computers to make things much, MUCH quicker. Heck, I could probably do that, if I didn't have better things to do. The most important part though is cryptanalysis: when an attack on this function is found (which should be dead-simple for any cryptographer out there), you'll probably be able to generate a collision in under a second on your 5 year-old smartphone, just like what happened to MD5.

Now for the usual security warning: XXHASH is NOT a cryptographic function. Do NOT use it anywhere where security matters. It is made for creating hashtables or hashsets, allowing for O(1) lookup in memory, just like djb2. They are made for one thing only: speed, and a lot of it. When looking up an item in a hash table, the program needs to hash the key as fast as possible so it can jump to the corresponding address. A collision there simply means that the program has to iterate over a few items until it finds what it's looking for, which is not a big deal.

Meanwhile, if even a hypothetical academic-grade reduction on a cryptographic hash function is found, it is a massive deal. Its only purpose is to withstand any attempts at generating a collision. Which is why, despite SHA1 requiring absolutely massive amounts of computing power to generate a collision, and only one collision found as of writing this (afaik), it is still widely considered insecure.

Modern cryptographic hash algorithms must also return hashes at least 128 bits in length, or 256 if you want to protect yourself from quantum computers, heck maybe even 384 if this whitepaper is to be believed (although it is disputed). Note that this does NOT mean that any hash function that returns a 128 bit hash is secure.

ytrezq commented 5 years ago

@OverclockedSanic however, I suppose the amount of data to process for xxhash64 makes things quite different. I mean in less than a year.

Hashing fonction is serial, quantum computing is no help except if the content of the data generated for matching the hash does not matter.

ramytarchichy commented 5 years ago

@ytrezq Not sure what you meant in the first part, but quantum computing, despite not breaking symmetric primitives like hash functions, definitely does affect them. An attack with grover's algorithm reduces the complexity for generating a collision from 2^n to 2^(n/2), meaning SHA3-256 in a post-quantum setting actually offers 128 bit of security as opposed to 256, or even ~85.33 if the whitepaper above is correct (in which case, you'd need to use at least SHA3-384 to be secure).

Again, doesn't break hash functions, since all you need to do is increase the size, unlike RSA and ECDSA which will become useless no matter the key size.

Sauce: https://en.wikipedia.org/wiki/SHA-3#Security_against_quantum_attacks Grover's algorithm: https://en.wikipedia.org/wiki/Grover%27s_algorithm

sesse commented 4 years ago

The following program demonstrates an XXH64 collision:

#include <stdio.h>
#include <stdint.h>
#include <xxhash.h>

int main(void)
{
        const uint8_t str1[] = { 0xd0, 0x08, 0xad, 0x10, 0x02, 0x00, 0x00, 0x00, 0xd0, 0x08, 0xad, 0x10, 0x02, 0x00, 0x00, 0x00 };
        const uint8_t str2[] = { 0xa6, 0x03, 0x26, 0xc5, 0x01, 0x00, 0x00, 0x00, 0xa6, 0x03, 0x26, 0xc5, 0x01, 0x00, 0x00, 0x00 };
        printf("%lu\n", XXH64(str1, sizeof(str1), 156211));
        printf("%lu\n", XXH64(str2, sizeof(str2), 156211));
}

This was done simply by hashing lots of integers starting from zero, storing them all in a huge array, sorting said array and looking for duplicates. I couldn't get it to easily trigger without hashing the number twice, though; it seems there's some resistance to 64-bit collisions with only 64-bit inputs.

easyaspi314 commented 4 years ago

Or you could just use the known XXH64 multicollision:

#define XXH_INLINE_ALL
#include <xxhash.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <inttypes.h>

static void print_buf(const uint8_t *buffer, size_t len)
{
    printf("0x");
    for (size_t i = 0; i < len; i++) {
        printf("%02x", buffer[i]);
    }
}

// https://github.com/Cyan4973/xxHash/issues/54#issuecomment-414061026
int main(void)
{
    uint8_t real[64], collide[64], seedBuf[8];

    // Fill with random integers
    FILE *urandom = fopen("/dev/urandom", "rb");
    if (!urandom)
        return 1;
    if (fread(real,    sizeof(real),    1, urandom) != 1)
        return 1;
    if (fread(seedBuf, sizeof(seedBuf), 1, urandom) != 1)
        return 1;
    fclose(urandom);

    uint64_t seed = XXH_readLE64(seedBuf);

    // Copy the collision code
    memcpy(collide, real, sizeof(collide));

    uint64_t x = XXH64(real,    sizeof(real),    seed);
    printf("XXH64(");
    print_buf(real, 64);
    printf(", 64, %#016"PRIx64") = %#016"PRIx64"\n", seed, x);
    for (size_t i = 0; i < 256; i++) {
        // ((uint64_t *)collide)[i % 4] += 0x0BA79078168D4BAF;
        uint64_t a = XXH_readLE64(&collide[8 * i % 4]);
        a += 0x0BA79078168D4BAF;
        XXH_writeLE64(&collide[8 * i % 4], a);

        // ((uint64_t *)collide)[4 + i % 4] += 0x9C90005B80000000;
        uint64_t b = XXH_readLE64(&collide[32 + 8 * i % 4]);
        b += 0x9C90005B80000000;
        XXH_writeLE64(&collide[32 + 8 * i % 4], b);

        uint64_t y = XXH64(collide, sizeof(collide), seed);
        printf("XXH64(");
        print_buf(collide, 64);
        printf(", 64, %#016"PRIx64") = %#016"PRIx64"\n", seed, y);
    }
}

However, XXH64 is bijective for 64-bit inputs, so it is impossible for there to be a collision at that length.

yoeden commented 2 years ago

@ramytarchichy made for creating hashtables or hashsets, allowing for O(1) lookup in memory, but what if I hash one of my datums and a collision occurs causing duplication of keys ? it is necessary that each hash should be unique even in this scenario.

ramytarchichy commented 2 years ago

@Br4inFreze This happens all the time, and the hash table should be built in a way to handle such scenarios. Usually it's done by putting all values whose keys have the same hash in a list of key-value pairs. That way, the caller hashes the key, finds the list in O(1), then iterates through the list and compares keys to find the correct value.

In a best-case scenario, all lists have only 1 element in them and lookup is constant-time. In a worst-case scenario, all elements have the same hash, get put in the same list, and the time complexity for lookups goes up to O(n). But, as long as everything is designed properly, that is rarely the case, and we can assume O(1) amortized.

TL;DR: you should never assume that the hash function will not create collisions.

amralieg commented 4 months ago

We found problematic behavior of the xxhash64 in pyspark, but I guess it uses the same implementation in C. The problem is that it is easier to get hash collision when the user uses | inside the hash; if user removed the | form the input string, it works just fine. here are some examples:

select xxhash64('9462828989|000005|01|0208788308|000005|010') as a union all select xxhash64('9494469836|000003|01|0240526368|000003|951') as a union all select xxhash64('9787246960|000001|01|0534194761|000001|954') as a union all select xxhash64('9508754997|000001|01|0254837498|000001|955') as a

these will return the following: a -4901753543805146088 -4901753543805146088 -5701086820182410381 -5701086820182410381