Cyan4973 / xxHash

Extremely fast non-cryptographic hash algorithm
http://www.xxhash.com/
Other
8.65k stars 760 forks source link

General discussion about XXH3 #175

Closed easyaspi314 closed 3 years ago

easyaspi314 commented 5 years ago

This is going to be a tracker for discussion, questions, feedback, and analyses about the new XXH3 hashes, found in the xxh3 branch.

@Cyan4973's comments (from xxhash.h):

XXH3 is a new hash algorithm, featuring vastly improved speed performance for both small and large inputs.

A full speed analysis will be published, it requires a lot more space than this comment can handle.

In general, expect XXH3 to run about ~2x faster on large inputs, and >3x faster on small ones, though exact difference depend on platform.

The algorithm is portable, will generate the same hash on all platforms. It benefits greatly from vectorization units, but does not require it.

XXH3 offers 2 variants, _64bits and _128bits. The first 64-bits field of the _128bits variant is the same as _64bits result. However, if only 64-bits are needed, prefer calling the _64bits variant. It reduces the amount of mixing, resulting in faster speed on small inputs.

The XXH3 algorithm is still considered experimental. It's possible to use it for ephemeral data, but avoid storing long-term values for later re-use. While labelled experimental, the produced result can still change between versions.

The API currently supports one-shot hashing only. The full version will include streaming capability, and canonical representation Long term optional feature may include custom secret keys, and secret key generation.

There are still a number of opened questions that community can influence during the experimental period. I'm trying to list a few of them below, though don't consider this list as complete.

easyaspi314 commented 5 years ago

Canonical representation : for the 64-bit variant, canonical representation is the same as XXH64() (aka big-endian).

Nothing is wrong with big endian output. I suggest for XXH3_64b, it will look like this (big endian):

0123456789abcdef-3  file.c

and XXH3_128b, this:

00112233445566778899aabbccddeeff-3  file.c

I say 2x big endian. That would be easiest to parse (in the case of being in a text file),

fscanf(hashfile, "%016llx%016llx-3", &hashLo, &hashHi);

and the easiest to output:

printf("%016llx%016llx-3\n", hashLo, hashHi);

Similar to how my XXH32a's canonical representation ended in -a, I think it makes sense for XXH3 to end in -3.

easyaspi314 commented 5 years ago

/ xxhash.h /

ifdef __cplusplus

static inline bool operator==(const XXH128_hash_t &a, const XXH128_hash_t &b) { return XXH128_hash_cmp(&a, &b) == 0; } static inline bool operator!=(const XXH128_hash_t &a, const XXH128_hash_t &b) { return XXH128_hash_cmp(&a, &b) != 0; } static inline bool operator>(const XXH128_hash_t &a, const XXH128_hash_t &b) { return XXH128_hash_cmp(&a, &b) > 0; } static inline bool operator<(const XXH128_hash_t &a, const XXH128_hash_t &b) { return XXH128_hash_cmp(&a, &b) < 0; } static inline bool operator<=(const XXH128_hash_t &a, const XXH128_hash_t &b) { return XXH128_hash_cmp(&a, &b) <= 0; } static inline bool operator>=(const XXH128_hash_t &a, const XXH128_hash_t &b) { return XXH128_hash_cmp(&a, &b) >= 0; }

endif / __cplusplus /

Cyan4973 commented 5 years ago

should we perhaps give a saturated subtraction for sorting

That's a great idea !

pointer or value?

I prefer by value when structure is of limited size. That's the case here : XXH128_hash_t is just 16-bytes long.

42Bastian commented 5 years ago

Regarding length == 0 return: I'd go for returning the seed (even if it is 0) instead of always returning 0. If one likes to differ the hashes of the same input data by providing a seed, you get still different hashes even if length == 0.

42Bastian commented 5 years ago

Pointer or value I'd vote for passing by pointer the XXH128_hash_t structure. Using value will imply copying for 32bit systems. Even on 64bit systems, passing 2 structs might not be done in registers.

BTW: Why "ll1" and "ll2", IMHO it should be suffixed by "h" and "l".

42Bastian commented 5 years ago

regarding static U64 XXH3_mul128(U64 ll1, U64 ll2) The Aarch32 code seems to be over-complicated for returning only the lower 64 bits. GCC produces elegant and as as it looks like, correct code: mul r3, r0, r3 mla r3, r2, r1, r3 umull r0, r1, r0, r2 add r1, r3, r1

easyaspi314 commented 5 years ago

It doesn't return the lower 64 bits, it adds the lower bits to the higher bits. This is pretty much the best way of doing it.

You can see it in the GCC extension code:

    __uint128_t lll = (__uint128_t)ll1 * ll2;
    return (U64)lll + (U64)(lll >> 64);

or if we break it up:

    __uint128_t ll1_u128 = (__uint128_t) ll1; // cast to __uint128_t
    __uint128_t ll2_u128 = (__uint128_t) ll2; // promoted
    __uint128_t lll = ll1_u128 * ll2_u128; // long multiply
    U64 lll_hi = (U64) (lll >> 64); // high bits
    U64 lll_lo = (U64) (lll & 0xFFFFFFFFFFFFFFFFULL); // low bits
    return lll_hi + lll_lo; // 64-bit add together

In x86_64 assembly, we get this:

_mul128:
        mov     rax, rsi
        mul     rdi    # note: rax stores high bits now
        add     rax, rdx
        ret

The reason I use inline assembly is because neither GCC nor Clang will emit the right code and prefer to spew out nonsense, and the __uint128_t type is not available on 32-bit.

The code is quite efficient:

        umull   r12, lr, r0, r2
        mov     r5, #0
        mov     r4, #0
        umaal   r5, lr, r1, r2
        umaal   r5, r4, r0, r3
        umaal   lr, r4, r1, r3
        adds    r0, lr, r12
        adc     r1, r4, r5

TBH XXH3_mul128 should be renamed to something like XXH3_foldedMult64to128 to match XXH_mult32to64 to make it clear that this is a "folding" multiply and not just a superoptimized 64->128-bit multiply.

easyaspi314 commented 5 years ago

I also kinda agree about the pointer thing. On 32-bit, those two structs will take up 8 registers. ARM for example can only pass up to 4 in a function call before it has to push to the stack. And x86 only has 7 registers.

easyaspi314 commented 5 years ago

Actually, I benchmarked it via qsort, and it seems like passing value is in fact better on both 32-bit and 64-bit (for x86, haven't tested ARM yet):

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <string.h>

typedef struct {
    uint64_t ll1;
    uint64_t ll2;
} XXH128_hash_t;

__attribute__((__noinline__)) int
XXH128_hash_cmp_ptr(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
    if (a->ll2 == b->ll2) {
        if (a->ll1 == b->ll1) {
            return 0;
        } else if (a->ll1 < b->ll1) {
            return -1;
        } else {
             return 1;
         }
    }
    if (a->ll2 < b->ll2) {
        return -1;
     } else {
         return 1;
     }
}

__attribute__((__noinline__)) int
XXH128_hash_cmp(const XXH128_hash_t a, const XXH128_hash_t b)
{
    if (a.ll2 == b.ll2) {
        if (a.ll1 == b.ll1) {
            return 0;
        } else if (a.ll1 < b.ll1) {
            return -1;
        } else {
             return 1;
         }
    }
    if (a.ll2 < b.ll2) {
        return -1;
     } else {
         return 1;
     }
}

int XXH128_hash_cmp_ptr_qsort(const void *start, const void *end)
{
    return XXH128_hash_cmp_ptr((const XXH128_hash_t *)start, (const XXH128_hash_t *)end);
}

int XXH128_hash_cmp_qsort(const void *start, const void *end)
{
    return XXH128_hash_cmp(*(const XXH128_hash_t *)start, *(const XXH128_hash_t *)end);
}

#define ROUNDS 1000000
static XXH128_hash_t arr[ROUNDS];
static XXH128_hash_t *arr_p[ROUNDS];

int main()
{
    srand(time(NULL));
    printf("arch: %zu-bit\n", sizeof(void *) * 8);
    for (int i = 0; i < ROUNDS; i++) {
        arr[i].ll1 = rand() | (uint64_t)rand() << 32;
        arr[i].ll2 = rand() | (uint64_t)rand() << 32;
        arr_p[i] = malloc(sizeof(XXH128_hash_t));
        memcpy(arr_p[i], &arr[i], sizeof(XXH128_hash_t));
    }
    double start, end;
    start = (double)clock();
    qsort(arr, ROUNDS, sizeof(XXH128_hash_t), XXH128_hash_cmp_qsort);
    end = (double)clock();
    printf("pass by value: %lf\n", (end - start) / CLOCKS_PER_SEC);

    start = (double)clock();
    qsort(arr_p, ROUNDS, sizeof(XXH128_hash_t *), XXH128_hash_cmp_ptr_qsort);
    end = (double)clock();
    printf("pass by pointer: %lf\n", (end - start) / CLOCKS_PER_SEC);
}
arch: 32-bit
pass by value: 0.352646
pass by pointer: 0.526951

arch: 64-bit
pass by value: 0.198776
pass by pointer: 0.449999

(note: compiled with clang 7.0.1 on a 2.0 GHz 2nd Gen Core i7, -O3)

GCC 8.3:

arch: 32-bit
pass by value: 0.442356
pass by pointer: 0.547503

arch: 64-bit
pass by value: 0.175436
pass by pointer: 0.360158
42Bastian commented 5 years ago

On ARM64 there are enough registers for ABI to pass it as value (r0..r7), but for 32Bit ARM the pointer-version is better as there are only 4 registers. So it highly depends on the CPU and the ABI. Edit: I checked, x64-GCC ABI uses four registers, so just enough to pass the two structs as value.

easyaspi314 commented 5 years ago

Nevermind that, this is the way to do it:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <string.h>

typedef struct {
    uint64_t ll1;
    uint64_t ll2;
} XXH128_hash_t;

__attribute__((__noinline__)) int
XXH128_hash_cmp_ptr(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
    if (a->ll2 == b->ll2) {
        if (a->ll1 == b->ll1) {
            return 0;
        } else if (a->ll1 < b->ll1) {
            return -1;
        } else {
             return 1;
         }
    }
    if (a->ll2 < b->ll2) {
        return -1;
     } else {
         return 1;
     }
}

__attribute__((__noinline__)) int
XXH128_hash_cmp(const XXH128_hash_t a, const XXH128_hash_t b)
{
    if (a.ll2 == b.ll2) {
        if (a.ll1 == b.ll1) {
            return 0;
        } else if (a.ll1 < b.ll1) {
            return -1;
        } else {
             return 1;
         }
    }
    if (a.ll2 < b.ll2) {
        return -1;
     } else {
         return 1;
     }
}

int XXH128_hash_cmp_ptr_qsort(const void *start, const void *end)
{
    return XXH128_hash_cmp_ptr(*(const XXH128_hash_t **)start, *(const XXH128_hash_t **)end);
}

int XXH128_hash_cmp_qsort(const void *start, const void *end)
{
    return XXH128_hash_cmp(*(const XXH128_hash_t *)start, *(const XXH128_hash_t *)end);
}

#define ROUNDS 10000000
static XXH128_hash_t arr[ROUNDS];
static XXH128_hash_t *arr_p[ROUNDS];
static XXH128_hash_t arr_p_alt[ROUNDS];

int main()
{
    srand(time(NULL));
    printf("arch: %zu-bit\n", sizeof(void *) * 8);
    for (int i = 0; i < ROUNDS; i++) {
        arr[i].ll1 = rand() | (uint64_t)rand() << 32;
        arr[i].ll2 = rand() | (uint64_t)rand() << 32;
        arr_p[i] = malloc(sizeof(XXH128_hash_t));
        memcpy(arr_p[i], &arr[i], sizeof(XXH128_hash_t));
        memcpy(&arr_p_alt[i], &arr[i], sizeof(XXH128_hash_t));
    }
    double start, end;
    start = (double)clock();
    qsort(arr, ROUNDS, sizeof(XXH128_hash_t), XXH128_hash_cmp_qsort);
    end = (double)clock();
    printf("pass by value: %lf\n", (end - start) / CLOCKS_PER_SEC);

    start = (double)clock();
    qsort(arr_p, ROUNDS, sizeof(XXH128_hash_t *), XXH128_hash_cmp_ptr_qsort);
    end = (double)clock();
    printf("pass by pointer: %lf\n", (end - start) / CLOCKS_PER_SEC);

    start = (double)clock();
    qsort(arr_p_alt, ROUNDS, sizeof(XXH128_hash_t), (int (*)(const void *, const void *)) XXH128_hash_cmp_ptr);
    end = (double)clock();
    printf("direct pass to qsort: %lf\n", (end - start) / CLOCKS_PER_SEC);
}

(I increased the count to make it more noticable)

arch: 32-bit
pass by value: 4.095442
pass by pointer: 5.924909
direct pass to qsort: 3.096856

arch: 64-bit
pass by value: 2.489569
pass by pointer: 4.165021
direct pass to qsort: 2.178582

Use the pointer, and to use it for qsort, say

/* To use it as a comparator for qsort, bsearch, or the like, the recommended way to use this is to
 * cast to int (*)(const void*, const void*). This has the best performance.
 *      qsort(array, size, sizeof(XXH128_hash_t), (int (*)(const void*, const void*)) XXH128_hash_cmp)
 * This assumes an array of XXH128_hash_t values. */

Because a compare function for pointers is going to be the most important usage of this, as I presume it will be used in stuff like qsort and bsearch.

easyaspi314 commented 5 years ago

As for the C++ version, we are going to want to inline the comparisons, as it makes std::sort blazing fast:

static inline bool operator ==(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
    return (a.ll1 == b.ll1) && (a.ll2 == b.ll2);
}
static inline bool operator <(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
    if (a.ll2 == b.ll2) {
        return a.ll1 < b.ll1;
    } else {
        return a.ll2 < b.ll2;
    }
}
static inline bool operator !=(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
    return !(a == b);
}
static inline bool operator >(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
    return b < a;
}
static inline bool operator <=(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
    return !(a > b);
}
static inline bool operator >=(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
    return !(a < b);
}

And by passing by &, we are also passing by pointer. It is just syntax sugar.

arch: 32-bit
pass by value: 4.056218
pass by pointer: 5.928958
direct pass to qsort: 3.117807
std::sort: 2.620486
std::sort inlined: 2.486056

arch: 64-bit
pass by value: 2.375233
pass by pointer: 4.089312
direct pass to qsort: 2.440806
std::sort: 1.826079
std::sort inlined: 1.583563

Edit: the first std::sort uses this:

static inline bool operator<(const XXH128_hash_t &a, const XXH128_hash_t &b) {
    return XXH128_hash_cmp_ptr(&a, &b) < 0;
}
easyaspi314 commented 5 years ago

Even better, and this is all C++11 compatible constexpr which while it isn't that important, it seems to help optimization a bit.

int
XXH128_hash_cmp_ptr(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
    if (a->ll2 == b->ll2) {
        return (a->ll1 < b->ll1) ? -1 : (a->ll1 > b->ll1);
    }
    if (a->ll2 < b->ll2) {
        return -1;
    } else {
         return 1;
    }
}
static inline constexpr bool
operator ==(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
    return (a.ll1 == b.ll1) && (a.ll2 == b.ll2);
}
static inline constexpr bool
operator<(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
#if (defined(_WIN32) || defined(__LITTLE_ENDIAN__)) && (defined(__SIZEOF_INT128__) || (defined(_INTEGRAL_MAX_BITS) && _INTEGRAL_MAX_BITS >= 128))
    /* convert to uint128_t and compare */
    return (a.ll1 | (static_cast<__uint128_t>(a.ll2) << 64))
         < (b.ll1 | (static_cast<__uint128_t>(b.ll2) << 64));
#else
    return (a.ll2 == b.ll2) ? (a.ll1 < b.ll1) : (a.ll2 < b.ll2);
#endif
}
static inline constexpr bool
operator !=(const XXH128_hash_t &a, const XXH128_hash_t &b)
{
    return !(a == b);
}
static inline constexpr bool
operator >(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
    return b < a;
}
static inline constexpr bool
operator <=(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
    return !(a > b);
}
static inline constexpr bool
operator >=(const XXH128_hash_t& a, const XXH128_hash_t& b)
{
    return !(a < b);
}

It is incredibly fast. Unfortunately, the uint128_t compare for the C-style sorting isn't as good as the other method.

arch: 32-bit
pass by value: 4.012220
direct pass to qsort: 3.027623
std::sort: 2.359985
std::sort by value: 2.335475
std::sort new: 2.096199

arch: 64-bit
pass by value: 2.185738
pass to qsort, u128_t compare: 2.127638
direct pass to qsort: 2.044731
std::sort: 1.629353
std::sort by value: 1.629171
std::sort new: 1.201221

On x86_64 with clang, operator < compiles to this:

__ZltRK13XXH128_hash_tS1_:              ## @operator<(XXH128_hash_t const&, XXH128_hash_t const&)
        mov     rax, qword ptr [rdi]
        mov     rcx, qword ptr [rdi + 8]
        cmp     rax, qword ptr [rsi]
        sbb     rcx, qword ptr [rsi + 8]
        setb    al
        ret

and on i386 it compiles to this:

__ZltRK13XXH128_hash_tS1_:              ## @operator<(XXH128_hash_t const&, XXH128_hash_t const&)
        push    ebp
        push    ebx
        push    edi
        push    esi
        mov     esi, dword ptr [esp + 24]
        mov     ecx, dword ptr [esp + 20]
        mov     eax, dword ptr [ecx + 8]
        mov     edx, dword ptr [ecx + 12]
        mov     ebx, dword ptr [esi + 8]
        mov     edi, dword ptr [esi + 12]
        mov     ebp, edx
        xor     ebp, edi
        mov     esi, eax
        xor     esi, ebx
        or      esi, ebp
        jne     LBB5_2
        mov     eax, dword ptr [ecx]
        mov     ecx, dword ptr [ecx + 4]
        mov     edx, dword ptr [esp + 24]
        cmp     eax, dword ptr [edx]
        sbb     ecx, dword ptr [edx + 4]
        jmp     LBB5_3
LBB5_2:
        cmp     eax, ebx
        sbb     edx, edi
LBB5_3:
        setb    al
        pop     esi
        pop     edi
        pop     ebx
        pop     ebp
        ret

Edit: GCC 8.3

arch: 32-bit
pass by value: 4.906234
direct pass to qsort: 3.260699
std::sort: 2.868732
std::sort by value: 2.862184
pass by pointer: 1.626228

arch: 64-bit
pass by value: 2.009833
pass to qsort, u128_t compare: 2.098546
direct pass to qsort: 1.861579
std::sort: 1.432066
std::sort by value: 1.489022
pass by pointer: 1.113134
Cyan4973 commented 5 years ago

Regarding static U64 XXH3_mul128(U64 ll1, U64 ll2) The Aarch32 code seems to be over-complicated for returning only the lower 64 bits.

As mentioned by @easyaspi314 , it's bit a more than just returning the lower bits. As the high bits get folded into the low bits, it might be possible to write this operation more compactly, and not need a "real" 128-bits multiplication. However, I did not figure out the math so far.

I'm fine with a rewrite of the function if it can lead to better performance. If some gcc assembly can help reach this purpose, it's all fine.

XXH3_mul128 should be renamed to something like XXH3_foldedMult64to128

Good point

Cyan4973 commented 5 years ago

Consequences of implementing a comparator :

In saying that one hash value is "bigger" than the other one, it makes XXH128_hash_t more than a bit field. It is now necessary to distinguish its components, as they are no longer equal : one must be "high", and the other "low". And now the order matters. Which one is high, which one is low ?

Possible answer : In keeping the convention that internal components of XXH128_hash_t are XXH64_hash_t, for which the convention is big-endian, it feels consistent to continue using the same convention, making the first component "high", and the second "low". Thoughts ?

Why "ll1" and "ll2", IMHO it should be suffixed by "h" and "l".

Yes, I agree. If we are implementing a comparator function, components must be distinguished as "high" or "low".

42Bastian commented 5 years ago

In keeping the convention that internal components of XXH128_hash_t are XXH64_hash_t, for which the convention is big-endian, it feels consistent to continue using the same convention, making the first component "high", and the second "low". Thoughts ?

I'd go for this. This would be compatible with a natural uint128_t if it exists.

Why "ll1" and "ll2", IMHO it should be suffixed by "h" and "l".

Yes, I agree. If we are implementing a comparator function, components must be distinguished as "high" or "low".

Also if you consider the struct as replacement for uint128_t

easyaspi314 commented 5 years ago

a natural uint128_t if it exists

// 0x00112233445566778899AABBCCDDEEFF
const __uint128_t x = ((__uint128_t)0x0011223344556677ULL<< 64) | 0x8899AABBCCDDEEFFULL;
$ clang -O3 -c test.c
$ hexdump -C test.o
...
00000180  ff ee dd cc bb aa 99 88  77 66 55 44 33 22 11 00  |........wfUD3"..|
...

For uint128_t 0x00112233445566778899AABBCCDDEEFF, the "natural" layout for most machines would be 0xFFEEDDCCBBAA99887766554433221100.

Doing it this way would allow someone to theoretically do this on a 64-bit little endian machine:

XXH128_hash_t hash = XXH3_128b(...);
__uint128_t val;
memcpy(&val, &hash, sizeof(__uint128_t));

Also if you consider the struct as replacement for uint128_t

It is supposed to be like one, and if the C standard had 128-bit integers, we would definitely use them. But it doesn't, and the only platforms that have it are 64-bit.

Cyan4973 commented 5 years ago

Yes, that's indeed the problem. At a binary level :

At a canonical level, I believe it makes sense to preserve the "natural" byte order, as if a __uint128_t could be displayed directly on screen. It makes "big endian" a natural choice. It follows the convention already established by XXH64_hash_t.

But binary and canonical levels do not have to match. Later on, a pair of conversion functions, such as XXH128_canonicalFromHash() will ensure that representation is always correct, whatever the platform.

So let's concentrate on binary level now, It would seem a matter to select which platform is more important, little or big endian. And these days, little endian is pretty much a winner.

That being said, even that does not matter much : __uint128_t is not an official type, and it's doubtful if it will ever be in the foreseeable future. So a binary compatibility (which would be limited to little endian platforms) seems of doubtful usefulness.

Consistency is also a topic. { hi, lo } feels to match the "natural" order expect by humans. That's how we write numbers ! I'm not opposed to { lo, hi }, I just suspect a few people will find it strange to read low before high.

But at the end of the day, both solutions work. As long as low and high are clearly expressed, it should be usable. It's mostly a matter of picking one and sticking to it.

I'm fine with both. Just trying to find a reason to "prefer" one over the other.

Cyan4973 commented 5 years ago

I went ahead and proceeded with renaming XXH128_hash_t members into low64 and high64. This will hopefully makes it clearer for comparison purposes.

I used the little-endian binary convention, hence low first. I still don't have a strong reason to prefer one over the other, so I went with the easiest one.

I also believe it's not opposed to using big-endian for the canonical representation.

Cyan4973 commented 5 years ago

"by value" vs "by reference" :

I would really not feel too much concerned by a memcpy(,,16) operation. From a performance perspective, this should be unnoticeable. I have actually read reports that it's better from a cache locality perspective, as pointers into different memory areas will maintain active more cache lines, thus degrading system performance.

Moreover, whenever the comparator function can be inlined, it all becomes moot. One can expect the comparison to be performed directly on source data. Btw, maybe it's important to offer an inlinable comparator.

In contrast, I would feel more concerned by branches, which can be costly when they mispredict. Hopefully, in this case, since 64-bit fields are compared, the likelihood of finding hashes with equal high64 fields becomes very low, making the branch extremely predictable.

If we do not constrain the comparator to return { -1, 0, 1 }, but more generally { <0, ==0, >0 }, we may be able to reduce the nb of branches in the comparator.

Cyan4973 commented 5 years ago

In this godbolt example, there is a comparison between a classical "branchy" comparator and a branchless one.

On 64-bit, the branchless variant looks good. It uses less instructions than the "branchy" one and avoids a cmov. I would expect it to run faster.

There is also a fully branchless comparator proposed at compare128_1(). It uses the same idea.

However, as mentioned before, it's unlikely that the high 64 bits of XXH128_hash_t are identical. If that happens, it probably means that hashed object was effectively identical. As predictable branches are extremely cheap, adding one to reduce amount of work is probably a good idea. compare128_2() uses that fact to introduce a single branch, between high and low comparisons.

Now it's a matter of comparing these implementations...

Cyan4973 commented 5 years ago

I've used @easyaspi314 qsort benchmark script, and adapted it to measure differences between comparators. Here is the modified source code : https://gist.github.com/Cyan4973/e1baa1a9a32b690508df76be64b34082

Some conclusions : I couldn't find any significant difference between the comparators. They all end up "within noise margin". I guess the comparator is only a small part of the sorting operation itself, so it doesn't matter much.

One indirection vs 2 indirections make a difference, though a moderate one (~10%). Indeed, invoking the byPtr comparator directly is faster than invoking a wrapper using ptr arguments which itself calls the byValue comparator while ensuring this one is not inlined. If it sounds like a win for byPtr, it's only because the test case mandates the use of byPtr. And when the byValue function can be inlined into the wrapper, it erases all differences.

So, the byPtr variant is a better fit for qsort, but is that an isolated example ?

I still believe that passing arguments by value feel more natural, and that inlining is actually the most important key to performance. There's an argument though for providing a comparator function "ready to use" with qsort. Even if isolated, it's still an important use case by itself.

easyaspi314 commented 5 years ago
typedef union {
  struct {
    uint64_t ll1;
    uint64_t ll2;
  };
  struct {
    /* don't use it directly! */
    size_t __val[(sizeof(uint64_t)*2)/sizeof(size_t)];
  };
} XXH128_hash_t;

int
XXH128_hash_cmp_ptr(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
#if defined(__LITTLE_ENDIAN__) || defined(_WIN32)
    /* Only compare the word sizes. This only works on little endian. */
    int cmp;
    size_t i = (sizeof(a->__val)/sizeof(a->__val[0]));
    while (i-- > 0) {
       cmp = (a->__val[i] > b->__val[i]) - (a->__val[i] < b->__val[i]);
       if (cmp) {
           return cmp;
       }
    }
    return cmp;
#else
    /* normal comparison, don't really care */
#endif
}

That has the best performance on 32-bit.

XXH128_hash_cmp_ptr_alt:
   mov rax, qword ptr [rsi+0x8]
   cmp qword ptr [rdi+0x8], rax
   je L7
   sbb eax, eax
   or eax, 0x1
L1:
   ret 
L7:
   mov rdx, QWORD PTR [rsi]
   mov eax, 0x1
   cmp QWORD PTR [rdi], rdx
   ja L1
   sbb eax, eax
   ret 

That clever bit of assembly has the best I've seen on 64-bit, and it is what GCC emits with the naive implementation:

__attribute__((__noinline__)) int
XXH128_hash_cmp_ptr(const XXH128_hash_t *a, const XXH128_hash_t *b)
{
    if (a->high64 == b->high64) {
        if (a->low64 > b->low64) {
            return 1;
        } else if (a->low64 < b->low64) {
            return -1;
        } else {
             return 0;
         }
    }
    if (a->high64 < b->high64) {
        return -1;
     } else {
         return 1;
     }
}
42Bastian commented 5 years ago

As for qsort: I guess the most impact comes from swaping the 128bit values if sorting does not use an array of pointers. So for machines which have enough memory to hold thousands of hashes, the advantage of the one or the other way to call the compare function disappears in the noise.

sergeevabc commented 5 years ago

Have you compared it with SeaHash (sources, design)?

ifduyue commented 5 years ago

Will there be createState/update/freeState APIs for XXH3?

Cyan4973 commented 5 years ago

Will there be createState/update/freeState APIs for XXH3?

Yes, that's basically next step.

Cyan4973 commented 5 years ago

Have you compared it with SeaHash

I wasn't aware of this hash, thanks for the link ! From its description, it says that it's a bit faster than XXH64, and that it's aimed at checksumming. Hence it's probably not optimized for small inputs. Also, 10% faster than XXH64 seems not enough to compete on speed with XXH3 on large inputs.

easyaspi314 commented 5 years ago

https://github.com/Cyan4973/xxHash/issues/180

There is currently a huge bug in the implementation, so some parts need to be redesigned.

Cyan4973 commented 5 years ago

I have to disagree with the mention of "huge bug" here. I think we do not share the same view of the scope of a non-cryptographic hash function.

I maintain that a non-cryptographic hash should not even try to "resist" an attack, by adding layers of obfuscations that will just be defeated later on. In the meantime, there will always be a temptation to think "I don't understand what's going on in this algorithm; sure, it's labelled non-cryptographic, but (wink) it seems it's almost as good ", then it gets used in places where it should not, and a bit later, we've got an effective attack ready to roll on all these brittle implementations.

A non cryptographic algorithm should not hide the fact that it's non cryptographic, not even in its implementations. In the case of XXH3, it's trivial to figure out how to intentionally generate collisions, on condition of knowing the secret key and the seed. So sure, if one targets the default key and the default seed, it's just plain simple. For other algorithms, it may take a study to get there, but once the method is known, it's only a few clicks away, and infinite collision generation is a matter of a simple script.

The objective of a non-cryptographic hash algorithm is just to quickly generate a bunch of bits, to be used in a hash table, bloom filter, etc. The changes I'm implementing are aimed at the space reduction problem, which is far from severe. They will, as a side effect, make it a bit more difficult to generate intentional collisions, but it certainly does not make the algorithm any more secure, and is not the goal. From a user perspective, the impact of this space reduction improvement is going to be minimal, if not completely unnoticeable. The way I see it, we are tuning (and hopefully improving) the quality / performance trade-off. Which is precisely the point of this test period.

Zhentar commented 5 years ago

I commented on the announcement blog post, but this seems a more appropriate place to put this... I'm attempting a C# port here: https://github.com/Zhentar/xxHash3.NET/

I'm unfortunately finding it unexpectedly difficult just to match the performance of xxHash64 for large keys - even the SSE2 version is only on par with xxHash32

Cyan4973 commented 5 years ago

Unfortunately, C# is not my strong point. So I can't provide straightforward hints.

I wouldn't be surprised if there is a need to use some kind of "unsafe" interface, in order to remove a bunch of automated silent checks, which end up being quite significant in a hot loop. How to do it is outside of my sphere of competence. Maybe looking at the source code of a few other xxHash projects in C# might help.

Zhentar commented 5 years ago

Figured out how to get MSVC to spit out an optimized build of xxHash... maybe it's my hardware (i7-6600U), not my implementation:

XXH32               :    1000000 ->     5928 it/s ( 5653.5 MB/s)
XXH32 unaligned     :    1000000 ->     6058 it/s ( 5777.1 MB/s)
XXH64               :    1000000 ->    11924 it/s (11372.1 MB/s)
XXH64 unaligned     :    1000000 ->    11917 it/s (11365.2 MB/s)
XXH3_64bits         :    1000000 ->     4778 it/s ( 4556.4 MB/s)
XXH3_64b unaligned  :    1000000 ->     4082 it/s ( 3892.5 MB/s)
42Bastian commented 5 years ago

Just my 2cents: I come from safety not security, so I am looking for non-cryptographic hashes which at best outperform CRC32 in both speed and collision-freenes. It does not matter if one can generate collisions by knowing the seed or the key. But I want to be sure, that a bunch of changed bits will not lead to a good hash.

sergeevabc commented 5 years ago

Replacing CRC32(C) which is currently used to verify files integrity is my goal too.

Cyan4973 commented 5 years ago

@Zhentar : I my experience, clang is way better at optimizing vectorized code, gcc is a distance second, generally not bad at SSE2 but definitively worse for AVX2 and NEON, then msvc, which is lagging by a lot.

I don't have an i7-6600u platform to test on these days. If there is a possibility of a remote access, I'll be glad to investigate. In the meantime, I would recommend you to test clang. On Windows platform, the way I would proceed is to install the msys2 + mingw64 subsystem, then use the integrated package manager to install clang.

I will try to post a Windows binary of xxhsum in the release tab. With regards to XXH3, the only thing this Windows binary will offer is the benchmark module. Hopefully, it might be enough to get a sense of the target performance.

Zhentar commented 5 years ago

A known good reference build would be very helpful (even if it's a Linux build, though I would prefer Windows).

Unfortunately, the other C# xxHash ports did not have much to offer me in regards to optimization tips...

Method ByteLength Mean Throughput
Zhent_xxHash32 15 10.61 ns 1,348.5 MB/s
xxHashSharp_xxHash32 15 10.96 ns 1,305.1 MB/s
Zhent_xxHash64 15 13.55 ns 1,056.1 MB/s
NeoSmart_xxHash64 15 114.19 ns 125.3 MB/s
yyproj_xxHash64 15 199.12 ns 71.8 MB/s
HashFunctions_xxHash64 15 317.77 ns 45.0 MB/s
core20_xxHash64 15 531.48 ns 26.9 MB/s
Zhent_xxHash64 1000000 82,298.71 ns 11,588.0 MB/s
Zhent_xxHash32 1000000 144,697.68 ns 6,590.8 MB/s
yyproj_xxHash64 1000000 587,337.64 ns 1,623.7 MB/s
HashFunctions_xxHash64 1000000 1,093,433.89 ns 872.2 MB/s
xxHashSharp_xxHash32 1000000 1,118,007.87 ns 853.0 MB/s
NeoSmart_xxHash64 1000000 2,135,509.68 ns 446.6 MB/s
core20_xxHash64 1000000 5,708,470.65 ns 167.1 MB/s

(note that this is a different machine than my previous post, I don't think my xxHash32 is actually meaningfully faster than yours)

Cyan4973 commented 5 years ago

I added 2 Windows binaries to the release tab, one gcc + SSE2, and one clang + AVX2.

XXH3 is accessible through the benchmark module -b, it will only provide speed numbers, not the hash value itself.

Zhentar commented 5 years ago

Thanks! That definitely helps. My initial impression is that clang's speed (at least with AVX2) comes from completely unrolling & inlining the block stripe accumulation loops, so that all of the key indexing can use immediate offsets.

I wonder, how does reducing the keyset size to 32 influence things for you? 8 stripes per block instead of 16 seems like it could be more friendly to less aggressive compilers.

Cyan4973 commented 5 years ago

I tried this suggestion of reducing the key size to reduce the loop size, in the experience it made gcc-8 slightly better when using -march=native (but not -mavx2), but clang became worse. As a side-effect , both versions had similar performance, though clang was still leading by a small margin.

So it's not entirely win-win nor lose-lose, but since it reduces top performance by ~7%, I guess current key size and associated loop size still seem preferable.

clang's speed (at least with AVX2) comes from completely unrolling & inlining the block stripe accumulation loops, so that all of the key indexing can use immediate offsets.

did you checked that in the assembly ?

Zhentar commented 5 years ago

Yep. It's pretty impressive - all of XXH3_hashLong has just two dozen scalar instructions, with only three in the block loop.

But I managed to get my loop pretty close, with just six!

Method ByteLength Mean Throughput
XxHash3AVX2 524288 16.25 us 30,772.7 MB/s

(clang's: XXH3_64b unaligned : 102400 -> 385931 it/s (37688.5 MB/s))

I think I know how to shrink that gap a touch, but it's going to make some already hideous code look even worse 😩

easyaspi314 commented 5 years ago
Build FAILED.

"/Users/user/xxHash3.NET/xxHash3.NET.sln" (default target) (1) ->
"/Users/user/xxHash3.NET/xxHash3/xxHash3.csproj" (default target) (2) ->
"/Users/user/xxHash3.NET/xxHash3/xxHash3.csproj" (Build target) (2:2) ->
(CoreCompile target) ->
  xxHash3_SSE2.cs(3,22): error CS0234: The type or namespace name 'Intrinsics' does not exist in the namespace 'System.Runtime' (are you missing an assembly reference?) [/Users/user/xxHash3.NET/xxHash3/xxHash3.csproj]
  xxHash3_SSE2.cs(4,22): error CS0234: The type or namespace name 'Intrinsics' does not exist in the namespace 'System.Runtime' (are you missing an assembly reference?) [/Users/user/xxHash3.NET/xxHash3/xxHash3.csproj]

Well, TIL, Mono doesn't have System.Runtime.Intrinsics. 😡

Zhentar commented 5 years ago

Hah, yeah, .NET Core 3.0-preview only at present. Did I screw up my #if define checks?

easyaspi314 commented 5 years ago

I had to fiddle with the config files so idk

either way, doesn't look like I'm trying it on my Mac today…

Zhentar commented 5 years ago

.NET Core is cross platform, macOS included 😃 You can grab the 3.0 preview here: https://dotnet.microsoft.com/download/dotnet-core/3.0 and then it should just be dotnet build -c Release to run a CLI build.

easyaspi314 commented 5 years ago
~/xxHash3.NET/xxHash3.Benchmarks/bin/Release/netcoreapp3.0 $ sudo ./xxHash3.Benchmarks
Password:
// Validating benchmarks:
// ***** BenchmarkRunner: Start   *****
// ***** Found 1 benchmark(s) in total *****
// ***** Building 1 exe(s) in Parallel: Start   *****
// ***** Done, took 00:00:00 (0.05 sec)   *****
// Found 1 benchmarks:
//   LongKeyTests.XxHash3AVX2: InProcess(MaxRelativeError=0.05, EnvironmentVariables=COMPlus_TieredCompilation=0, Toolchain=InProcessToolchain) [ByteLength=524288]

// **************************
// Benchmark: LongKeyTests.XxHash3AVX2: InProcess(MaxRelativeError=0.05, EnvironmentVariables=COMPlus_TieredCompilation=0, Toolchain=InProcessToolchain) [ByteLength=524288]
// *** Execute ***
// Launch: 1 / 1

// Benchmark Process Environment Information:
// Runtime=.NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT
// GC=Concurrent Workstation
// Job: InProcess(MaxRelativeError=0.05, EnvironmentVariables=COMPlus_TieredCompilation=0, Toolchain=InProcessToolchain)

OverheadJitting  1: 1 op, 415857.00 ns, 415.8570 us/op

Unhandled Exception: System.InvalidOperationException: Benchmark LongKeyTests.XxHash3AVX2: InProcess(MaxRelativeError=0.05, EnvironmentVariables=COMPlus_TieredCompilation=0, Toolchain=InProcessToolchain) [ByteLength=524288] takes to long to run. Prefer to use out-of-process toolchains for long-running benchmarks.
   at BenchmarkDotNet.Toolchains.InProcess.InProcessExecutor.Execute(ExecuteParameters executeParameters)
   at BenchmarkDotNet.Running.BenchmarkRunnerClean.RunExecute(ILogger logger, BenchmarkCase benchmarkCase, BenchmarkId benchmarkId, IToolchain toolchain, BuildResult buildResult, IResolver resolver, IDiagnoser diagnoser, Boolean& success)
   at BenchmarkDotNet.Running.BenchmarkRunnerClean.Execute(ILogger logger, BenchmarkCase benchmarkCase, BenchmarkId benchmarkId, IToolchain toolchain, BuildResult buildResult, IResolver resolver)
   at BenchmarkDotNet.Running.BenchmarkRunnerClean.RunCore(BenchmarkCase benchmarkCase, BenchmarkId benchmarkId, ILogger logger, IResolver resolver, BuildResult buildResult)
   at BenchmarkDotNet.Running.BenchmarkRunnerClean.Run(BenchmarkRunInfo benchmarkRunInfo, Dictionary`2 buildResults, IResolver resolver, ILogger logger, List`1 artifactsToCleanup, String rootArtifactsFolderPath, StartedClock& runChronometer)
   at BenchmarkDotNet.Running.BenchmarkRunnerClean.Run(BenchmarkRunInfo[] benchmarkRunInfos)
   at BenchmarkDotNet.Running.BenchmarkRunner.RunWithDirtyAssemblyResolveHelper(Type type, IConfig config)
   at BenchmarkDotNet.Running.BenchmarkRunner.Run[T](IConfig config)
   at xxHash3.Program.Main() in /Users/user/xxHash3.NET/xxHash3.Benchmarks/Program.cs:line 31
Abort trap: 6

🤔

I don't have AVX2 (I have Sandy Bridge), but I changed it to UseSse2

It's not like it's doing anything, it only is doing about 3% CPU.

Zhentar commented 5 years ago

Managed to close the gap quite a bit! Manually mimicked some of the optimizations clang made, along with realizing that using a 512kb test buffer on a processor with a 512kb L2 cache was maybe not without downsides.

Method ByteLength Mean Error StdDev Throughput
XxHash3AVX2 102400 2.686 us 0.0337 us 0.0281 us 36,354.0 MB/s
XxHash3AVX2 524288 15.269 us 0.2233 us 0.2089 us 32,745.5 MB/s

I also realized Clang did a bit more than I realized - it also completely de-shingled the keys array to get all the keys loads aligned, and even pre-computed the key shuffle in the accumulator scramble.

@easyaspi314
Hmmm. The next line you're supposed to see is this:

OverheadJitting  1: 1 op, 314000.00 ns, 314.0000 us/op
WorkloadJitting  1: 1 op, 14012400.00 ns, 14.0124 ms/op

Which means it's freezing while trying to JIT compile the code 😬 I haven't transferred any of my AVX2 learnings to the SSE2 code yet though, so that part isn't so fast yet anyway.

easyaspi314 commented 5 years ago

Clang is pretty clever. Usually. GCC too (although it has its weaknesses).

MSVC stands true to its name: MSVC Sucks at Vectorizing Code

A lot of the times when I am optimizing SIMD code, I use the compiler output from GCC and Clang targeting different arches, and apply them back and forth. So, I get it.

314000.00 ns, 314.0000 us/op

I see what you did there 😆

I'm no C# developer, so I don't know how much help I can be.

easyaspi314 commented 5 years ago

Any news?

Cyan4973 commented 5 years ago

I'm completing a 64-bit collision analyzer. It's able to generate and detect collisions using enough hashes to provide meaningful results and therefore judge the quality of any 64-bit hash algorithm. The default test generates 24 billions hashes, and needs at least 64 GB of RAM to run properly.

The tool will make it possible to check that XXH3_64bits() has indeed the collision resistance of an "ideal" 64-bit hash algorithm.

It will be also used later in the refactoring of XXH3_128bits(), to prove that it has better collision resistance than 64-bits variants.

Unfortunately, it will not be able to directly measure the collision rate for 128 bits hashes, as the amount of resources needed is, well, unreasonable.