General discussion about XXH3

easyaspi314 commented 5 years ago

This is going to be a tracker for discussion, questions, feedback, and analyses about the new XXH3 hashes, found in the xxh3 branch.

@Cyan4973's comments (from xxhash.h):

XXH3 is a new hash algorithm, featuring vastly improved speed performance for both small and large inputs.

A full speed analysis will be published, it requires a lot more space than this comment can handle.

In general, expect XXH3 to run about ~2x faster on large inputs, and >3x faster on small ones, though exact difference depend on platform.

The algorithm is portable, will generate the same hash on all platforms. It benefits greatly from vectorization units, but does not require it.

XXH3 offers 2 variants, _64bits and _128bits. The first 64-bits field of the _128bits variant is the same as _64bits result. However, if only 64-bits are needed, prefer calling the _64bits variant. It reduces the amount of mixing, resulting in faster speed on small inputs.

The XXH3 algorithm is still considered experimental. It's possible to use it for ephemeral data, but avoid storing long-term values for later re-use. While labelled experimental, the produced result can still change between versions.

The API currently supports one-shot hashing only. The full version will include streaming capability, and canonical representation Long term optional feature may include custom secret keys, and secret key generation.

There are still a number of opened questions that community can influence during the experimental period. I'm trying to list a few of them below, though don't consider this list as complete.

128-bits output type : currently defined as a structure of 2 64-bits fields.
- That's because 128-bit values do not exist in C standard.
- Note that it means that, at byte level, result is not identical depending on endianess.
- However, at field level, they are identical on all platforms.
- The canonical representation will solve the issue of identical byte-level representation across platforms, which is necessary for serialization.
- Would there be a better representation for a 128-bit hash result ?
- Are the names of the inner 64-bit fields important ? Should they be changed ?
Canonical representation : for the 64-bit variant, canonical representation is the same as XXH64() (aka big-endian).
- What should it be for the 128-bit variant ?
- Since it's no longer a scalar value, big-endian representation is no longer an obvious choice.
- One possibility : represent it as the concatenation of two 64-bits canonical representation (aka 2x big-endian)
- Another one : represent it in the same order as natural order in the struct for little-endian platforms.
- Less consistent with existing convention for XXH32/XXH64, but may be more natural for little-endian platforms.
Associated functions for 128-bit hash : simple things, such as checking if 2 hashes are equal, become more difficult with struct.
- Granted, it's not terribly difficult to create a comparator, but it's still a workload.
- Would it be beneficial to declare and define a comparator function for XXH128_hash_t ?
- Are there other operations on XXH128_hash_t which would be desirable ?
Variant compatibility : The first 64-bit field of the _128bits variant is the same as the result of _64bits.
- This is not a compulsory behavior. It just felt that it "wouldn't hurt", and might even help in some (unidentified) cases.
- But it might influence the design of XXH128_hash_t, in ways which may block other possibilities.
- Good idea, bad idea ?
Seed type for 128-bits variant : currently, it's a single 64-bit value, like the 64-bit variant.
- It could be argued that it's more logical to offer a 128-bit seed input parameter for a 128-bit hash.
- Although it's also more difficult to use, since it requires to declare and pass a structure instead of a value.
- It would either replace current choice, or add a new one.
- Farmhash, for example, offers both variants (the 128-bits seed variant is called doubleSeed).
- If both 64-bit and 128-bit seeds are possible, which variant should be called XXH128?
Result for len==0 : Currently, the result of hashing a zero-length input is the seed.
- This mimics the behavior of a crc : in which case, a seed is effectively an accumulator, so it's not updated if input is empty.
- Consequently, by default, when no seed specified, it returns zero. That part seems okay (it used to be a request for XXH32/XXH64).
- But is it still fine to return the seed when the seed is non-zero ?
- Are there use case which would depend on this behavior, or would prefer a mixing of the seed ?

Cyan4973 commented 4 years ago

reroll patch merged

aras-p commented 4 years ago

Google Play Store guidelines require a feature test for ARMv7 NEON support

Depends, I wouldn't say that NEON vs non-NEON dispatcher is "absolutely required". E.g. at Unity we've been requiring NEON support for Android builds since ~2016, so for us the dispatcher there is not necessary.

and as for the AVX2/SSE2 paths, if anyone wants to use the AVX2 path, it is impractical if you ever plan to distribute without a dispatcher

Same here, if someone is using xxHash for some in-house tooling or an application specifically targeted at high-end PC users (e.g. visualization/modeling/scientific etc.), their software might already require AVX2 or similar, in which case the dispatcher is not necessary.

Of course there are classes of users that would need or prefer to have a runtime dispatcher. But it should be optional IMHO.

aras-p commented 4 years ago

XXH3_128bits() is not, and that's my next priority

I guess it's still expected the the hash values of 128 bit will change before the "final" release? e.g. today it still has some discrepancies compared to 64 bit one (e.g. hash of zero-sized buffer is full zeroes, unlike for 64 bit)

Cyan4973 commented 4 years ago

Yes, it will be updated, and follow the new model

easyaspi314 commented 4 years ago

Adapted the implementation to Rust, and made it pretty fast. (this is with Clang 8 and the XXH_VECTOR=0 path, because it isn't fair to compare intrinsics to autovec)

https://gist.github.com/easyaspi314/b01f0f68d2e2aaaacb4f1a345a552dd6

./xxhsum 0.7.0 (64-bits x86_64 + SSE2 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    47738 it/s ( 4661.9 MB/s)
XXH32 unaligned     :     102400 ->    40817 it/s ( 3986.0 MB/s)
XXH64               :     102400 ->    65199 it/s ( 6367.1 MB/s)
XXH64 unaligned     :     102400 ->    62238 it/s ( 6077.9 MB/s)
XXH3_64bits         :     102400 ->   111937 it/s (10931.4 MB/s)
XXH3_64b unaligned  :     102400 ->   105455 it/s (10298.3 MB/s)
XXH3_64b_RS         :     102400 ->    94732 it/s ( 9251.2 MB/s)
XXH3_64b_RS unalign :     102400 ->    90630 it/s ( 8850.6 MB/s)
XXH128              :     102400 ->   104704 it/s (10225.0 MB/s)
XXH128 unaligned    :     102400 ->    89104 it/s ( 8701.6 MB/s)

easyaspi314 commented 4 years ago

My new phone came!

Google Pixel 2 XL 2.46 GHz Qualcomm Snapdragon 835 (aarch64) Clang 8.0.0 Termux

./xxhsum 0.7.0 (64-bits aarch64 little endian), Clang 8.0.0 (tags/RELEASE_800/final), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    31663 it/s ( 3092.0 MB/s)
XXH32 unaligned     :     102400 ->    27540 it/s ( 2689.4 MB/s)
XXH64               :     102400 ->    31554 it/s ( 3081.5 MB/s)
XXH64 unaligned     :     102400 ->    31567 it/s ( 3082.7 MB/s)
XXH3_64bits         :     102400 ->    61226 it/s ( 5979.1 MB/s)
XXH3_64b unaligned  :     102400 ->    57185 it/s ( 5584.5 MB/s)
XXH128              :     102400 ->    61255 it/s ( 5982.0 MB/s)
XXH128 unaligned    :     102400 ->    56957 it/s ( 5562.2 MB/s)

easyaspi314 commented 4 years ago

Was brainstorming some ideas for size optimizations. Right now, even with XXH_REROLL, xxHash is still a rather hefty .o file.

I feel we should have something like XXH_TINY which enables aggressive size optimizations (at the cost of performance).

Use the streaming methods for single shot. This will make code a lot smaller because it doesn't need so many copies.
Define XXH_FORCE_INLINE to simply static and let the compiler decide.

Make the XXH3 entry logic more like this:

static XXH64_hash_t XXH3_64_hashInternal(const void *data, size_t len, const void *secret, size_t secretSize)
{
XXH64_hash_t ret;
if (len == 0) return 0;
if (len <= 16) ret = ...
else if (len <= 128) ret = ...
return XXH3_avalanche(ret);
}
XXH_PUBLIC_API XXH64_hash_t XXH3_64bits(const void *data, size_t len)
{
return XXH3_64_hashInternal(data, len, kSecret, size of(kSecret));
}
XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSeed(const void *data, size_t len, XXH64_hash_t seed)
{
char secret[XXH_DEFAULT_SECRET_SIZE];
XXH3_initKeySeed(secret, seed);
return XXH3_64_hashInternal(data, len, secret, sizeof(secret));
}
...

This also reduces the chance of code duplication memes. As you can see, I moved the avalanche to the end and the zero return early. Not sure whether it makes a diff. There is probably a sweetspot

Define XXH_VECTOR to 0 in tiny mode. Kinda obvious optimization.
Reroll more loops. I am mainly looking at the XXH3_mix16B loops, which have some rather cryptic logic. Compilers are usually good at inlining and unrolling, but they can't always identify how to reroll something.

Try to rearrange the logic to reduce duplication. Example:

XXH_FORCE_INLINE void XXH32_hashLong(U32 v[4], const void *data, size_t len)
{
 do {
       XXH32_round(...);
       XXH32_round(...);
       XXH32_round(...);
       XXH32_round(...);
 } while (...);
}

For states, perhaps have a constant default state and memcpy. MIPS, ARM and aarch64 tend to use twice as many bytes for large integer literals, as they have to piece them together 16 bits at a time:
```
uint64_t x = 0x123456789ABCDEF0;
```
aarch64 code might be:
```
mov  x0, #0xDEF0
movk x0, #0x9ABC, lsl #16
movk x0, #0x5678, lsl #32
movk x0, #0x1234, lsl #48
```

Additionally, I think we need to work more on cleanliness for next release. Some things to clean up:

Cast to BYTE * earlier. This makes things easier to port to other languages which don't play well with void pointers.
I think we should put the original primes in hex. Languages without a native unsigned type need to negate them.
Enforce a coding style. East coast, west const, etc.
Try to reduce nested blocks. TBH, we can probably mix declarations for the 64-bit functions. C89 doesn't support long long, and so we can just write C++/C99 compatible code. Depends on what you want the conformance level to be.

42Bastian commented 4 years ago

My 2cents about size optimization: I don't think it is really needed. At least not for the 64 or 128 bit hashes. What is the target CPU of these? Likely 64 bit systems aren't to memory constraint to need a sized optimized hash. Even most Cortex-M systems have plenty of Flash.

Cyan4973 commented 4 years ago

Additionally, I think we need to work more on cleanliness for next release.

Agreed

Cast to BYTE * earlier.

I disagree. void* is the only "universal" type which makes it possible to feed anything to the function. Even char* or BYTE* must be casted into whenever input memory area carries another type.

This makes things easier to port to other languages which don't play well with void pointers.

I don't understand that part. "other languages" access the reference library through binders, which are restrained to invoke the *.h interface only, so they don't care about the internals of the library, which remains C compiled.

I think we should put the original primes in hex.

Agreed

Enforce a coding style. East coast, west const

Sure, though I don't have a feeling that there is no coding style. There is one, even if it's not specified. And it can certainly be updated / improved.

we can probably mix declarations for the 64-bit functions

I disagree.

I get the idea : long long is not strict ansi c89 anyway, so this part of the code could be written in c99 instead. However, that's not a good enough reason to start having 2 different coding styles depending on the section of the source file. Besides, long long is only "official" since c99, but effectively, a lot of compilers have been c89+, meaning c89 plus a subset of c99 (and their own extensions), among which long long was very well supported, while mix declaration was not. Visual Studio is an excellent example. So these 2 worlds don't overlap.

Now, if we were writing a new library, starting with requiring c99 support could be a reasonable discussion. But xxhash is an existing library which is already widely deployed, hence it's not a good idea to introduce a potential build break for the benefit of reduced nesting.

Finally, this is more subtle, but there is actually some benefit at scoping variables to their most limited nest level : it's not just about opening, it's also about closing. Ending a variable scope is a good way to keep the rest of the code clean, making it easier and safer to evolve / refactor. This is an under-appreciated property.

Cyan4973 commented 4 years ago

In #231, there is a near-final proposal for XXH128(). It's not "completely finished" because NEON and VSX accumulator code paths need to be updated (cc @easyaspi314).

This proposal supports the same level of functionality as XXH3_64bits(), aka it can accept a custom secret of size >= minimal size, and a streaming interface is available, following the same API, and using the same state.

In term of design, we are back to original design of v0.7.0, where 8 lanes of 64 bits are combined at the end to produce a 128 bits hash. In order for each input to impact 128-bit of accumulator, the final operation is performed on the nearby lane, instead of its own lane, as in 64-bit case. This only adds one shuffle operation in the critical loop, but costs nonetheless ~15% performance. I tried other variants, but they came with outsized impact on performance or flexibility.

There are still parts of this design which can be discussed.

For example, the hash tries to be bijective at 128-bits input. This comes at a speed cost for inputs <= 16 bytes, though not a big one. I presumed that this would be a nice property, although I have no strong use case to declare this property a "must".

The 128-bit hash sometimes manages to produce the same value in its low64 field as the 64-bit one when implementation-convenient, but not always. A big question is, is that an important property ? If it is, maybe it should be generalized ? In which case, several more changes should be done, notably on the 64-bit side, in order to ensure they produce the same value. This will come at a (moderate but non null) speed cost. But here too, I'm not sure if this is an "important" enough property, since I have no use case which would depend on it.

Cyan4973 commented 4 years ago

I think we should put the original primes in hex.

Done

Cyan4973 commented 4 years ago

I don't see anything else to add, so I guess it's probably time for a new release ...

easyaspi314 commented 4 years ago

Has VSX been updated?

Also, we should do this for Clang.

#ifndef __has_builtin
#   define __has_builtin(x) 0
#endif 

#if __has_builtin(__builtin_rotateleft32) && __has_builtin(__builtin_rotateleft64)
#  define XXH_rotl32 __builtin_rotateleft32
#  define XXH_rotl64 __builtin_rotateleft64
#else ...

I just found yet another weird Clang bug, which affects literally every Clang version. Long story short, LLVM's instcombine is very likely to convert a multiply -> rotate into two multiplies and a shift->insert.

So for code like this:

U32 mulrot(U32 a) {
    a *= 0x90000007;          // imul eax, edi, 0x90000007
    a = (a << 4) | (a >> 28); // rol eax, 4
    return a;
}

it emits this:

U32 mulrot(U32 a) {
    U32 left = a * 0x90000007;         // imul eax, edi
    U32 right = a * (0x90000007 << 4); // imul ecx, edi, 0x70
    left >>= 28;                       // shr eax, 28
    left |= right;                     // or eax, ecx
    return left;                       // ret
}

It seems to negatively affect XXH32 and XXH64's round functions, and was likely the reason why the finalization stage was so ginormous.

https://godbolt.org/z/kSo7LN

Cyan4973 commented 4 years ago

These are good points @easyaspi314, they can be added for next release.

easyaspi314 commented 4 years ago

Sorry about the silence, I had something come up.

I think you are right, we should be targeting POWER8 instead of POWER9, as POWER9 is only 2 years old. Additionally, instead of __VSX__, we should check for _ARCH_PWR8 and/or __POWER8_VECTOR__. It completely slipped over my head because apparently QEMU doesn't pick up on illegal instructions, and the POWER9 server is really finicky (Why is CentOS 7.6 shipping glibc 2.17?). I might try one of the POWER8 servers.

Anyways.

vec_xxpermdi is the correct name for GCC, but xl calls it vec_permi. Confusing.

I think we need a normal vperm for the byteswap, but I can't get it to not emit garbage preswaps.

However, what is really frustrating is that Clang in ppc64eb is really happy to turn XXH3 into this:

Screen Shot 2019-08-16 at 9 47 27 PM

Hopefully I can have this POWER8 compatible and make big endian code a bit nicer.

I'll try to get everything finalized this weekend.

easyaspi314 commented 4 years ago

OOOOOOH I just discovered vpermxor

Purpose

Applies a permute and exclusive-OR operation on two byte vectors.

Prototype
__vector unsigned char __vpermxor(__vector unsigned char a, __vector unsigned char b, __vector unsigned char mask);
Result

For each i (0 <= i < 16), let indexA be bits 0 - 3 and indexB be bits 4 - 7 of byte element i of mask.

Byte element i of the result is set to the exclusive-OR of byte elements indexA of a and indexB of b.

(note that xlC uses __vpermxor, gcc uses vec_permxor)

So instead of doing this:

    load data_vec
    swap data_vec
    load key_vec
    swap key_vec
    xor key_vec, data_vec

we can do

    load data_vec
    swap data_vec
    load key_vec
    vpermxor data_vec, key_vec, { 0x07, 0x16, 0x25, 0x34, ... }

easyaspi314 commented 4 years ago

I just pushed. I was really disappointed because I realized that the only big endian POWER8 machine was running 32-bit... :(

However, little endian is working on POWER8, and big endian works in qemu-ppc64-static 2.12.0 with -mcpu=power8.

cpu     : POWER8E (raw), altivec supported

./xxhsum-clang 0.7.1 (64-bits ppc64 + POWER8 vector little endian), Clang 7.0.0 (tags/RELEASE_700/final), by Yann Collet
Sample of 100 KB...
XXH32               :     102400 ->    49152 it/s ( 4800.0 MB/s)
XXH32 unaligned     :     102400 ->    43463 it/s ( 4244.4 MB/s)
XXH64               :     102400 ->    98446 it/s ( 9613.8 MB/s)
XXH64 unaligned     :     102400 ->    93068 it/s ( 9088.7 MB/s)
XXH3_64bits         :     102400 ->   101830 it/s ( 9944.3 MB/s)
XXH3_64b unaligned  :     102400 ->    83841 it/s ( 8187.5 MB/s)
XXH128              :     102400 ->    90658 it/s ( 8853.3 MB/s)
XXH128 unaligned    :     102400 ->    82686 it/s ( 8074.8 MB/s)

easyaspi314 commented 4 years ago

I think I found the right manual 128-bit multiply which obsoletes the ugly ARM assembly version and avoids confusing carry tracking.

Ready:

U64 Multiply128(U64 a, U64 b, U64 *high)
{
    U64 lo_lo = (a & 0xFFFFFFFF) * (b & 0xFFFFFFFF);
    U64 hi_lo = (a >> 32)        * (b & 0xFFFFFFFF);
    U64 lo_hi = (a & 0xFFFFFFFF) * (b >> 32);
    U64 hi_hi = (a >> 32)        * (b >> 32);

    /* Add two half products to a full product. This caps each
     * operation at ULLONG_MAX, and encourages ARM compilers to
     * emit the powerful UMAAL instruction which was specifically
     * designed for this purpose.
     * TODO: Better names. */
    U64 cross = (lo_lo >> 32)
              + (hi_lo & 0xFFFFFFFF)
              + lo_hi;
    U64 top = (hi_lo >> 32)
            + (cross >> 32)
            + hi_hi;
    *high = top;
    return (cross << 32) | (lo_lo & 0xFFFFFFFF);
}

clang -O3 -S --target=arm-none-eabi -march=ARMv7-a -fomit-frame-pointer -fno-unwind-tables -fno-asynchronous-unwind-tables

Multiply128:
        push    {r4, r5, r11, lr}
        umull   r12, r5, r2, r0
        umull   lr, r4, r3, r0
        ldr     r0, [sp, #16]
        umaal   lr, r5, r2, r1
        umaal   r4, r5, r3, r1
        mov     r1, lr
        strd    r4, r5, [r0]
        mov     r0, r12
        pop     {r4, r5, r11, pc}

This change saves 2 cycles and 4-8 bytes a pop because there are no more mov instructions.

As for the algorithm, it is literally just grade school cross multiplication. No complex carries, as they all cap at ULLONG_MAX.

  23
  57
----
  21
 140
 150
1000
----
1311

lo_lo = 3 * 7 = 21
hi_lo = 2 * 7 = 14
lo_hi = (3 * 5 = 15)
      + (floor(lo_lo / 10) = 2)
      + (hi_lo % 10 = 4)
      = 21
high = (2 * 5 = 10)
         + (floor(hi_lo / 10) = 1)
         + (floor(lo_hi / 10) = 2)
         = 13
low  = ((lo_lo % 10) = 1)
         + (((lo_hi * 10) % 100) = 10)
         = 11
result = low + (high * 100) = 1311

I have not tested the codegen on other compilers, but Clang clearly sees it, and it is arguably more understandable (once we format it right) than the current mess.

This avoids all carry flag messes on ARM

Additionally, it seems that for a better long multiply on MSVC x86, you can do (U64)(U32)(x). I need to investigate.

easyaspi314 commented 4 years ago

gcc-9 -O3 -S -m32 -mno-sse2

Multiply128:
    push    ebp
    push    edi
    push    esi
    push    ebx
    xor     ebx, ebx
    sub     esp, 12
    mov     esi, dword ptr[esp + 40]
    mov     ebp, dword ptr[esp + 44]
    mov     eax, esi
    mul     dword ptr[esp + 32]
    mov     dword ptr[esp], eax
    mov     eax, esi
    mov     dword ptr[esp + 4], edx
    mul     dword ptr[esp + 36]
    mov     edi, edx
    mov     edx, dword ptr[esp + 4]
    mov     esi, eax
    mov     ecx, esi
    mov     esi, edi
    mov     eax, edx
    xor     edx, edx
    add     ecx, eax
    mov     eax, ebp
    adc     ebx, edx
    mul     dword ptr[32 + esp]
    add     ecx, eax
    mov     eax, ebp
    adc     ebx, edx
    xor     edi, edi
    mul     dword ptr[esp + 36]
    add     esi, eax
    mov     eax, dword ptr[esp + 48]
    adc     edi, edx
    xor     edx, edx
    add     esi, ebx
    adc     edi, edx
    mov     edx, ecx
    mov     dword ptr[eax], esi
    mov     dword ptr[eax + 4], edi
    mov     eax, dword ptr[esp]
    add     esp, 12
    pop     ebx
    pop     esi
    pop     edi
    pop     ebp
    ret

GCC 9 i386's code gen isn't bad as long as it has SSE2 disabled (it still tries to autovectorize).

Cyan4973 commented 4 years ago

That looks great @easyaspi314 !

GCC 9 i386's code gen isn't bad as long as it has SSE2 disabled (it still tries to autovectorize).

Is it possible to disable it at function-level granularity ?

Cyan4973 commented 4 years ago

I have a new ARM chip to test, and I found the results interesting :

System details : Qualcomm aarch64 SM8150, termux Linux 4.14.78-16154136, clang v8.0.1, -O3.

	speed
XXH32	5.2 GB/s
XXH64	3.6 GB/s
XXH3	12 GB/s
XXH128	10.5 GB/s

Even though it's supposed to be a modern aarch64 cpu, its speed is worse for XXH64, presumably due to high usage of 64-bit multiplications.

The good news is that XXH3 works well, thanks to @easyaspi314 's NEON implementation.

easyaspi314 commented 4 years ago

Hmm. Is Clang maybe vectorizing XXH64?

Edit:

Is it possible to disable it at function-level granularity ?

Yes. We already do that.

Cyan4973 commented 4 years ago

Is Clang maybe vectorizing XXH64?

I thought that XXH32 was at risk of auto-vectorization, but XXH64 wasn't, at least not on x64 cpus. But the situation might be different on aarch64 ?

Anyway, I suppose it doesn't auto-vectorize : I benchmarked xxhsum with -O2, which I interpret as "will not auto-vectorize", and the resulting speed was the same. I even tried -O2 -fno-slp-vectorize -disable-loop-vectorization -disable-vectorization -fno-vectorize -fno-tree-vectorize just to be sure, same results.

42Bastian commented 4 years ago

I have a new ARM chip to test, and I found the results interesting :

System details : Qualcomm aarch64 SM8150, termux Linux 4.14.78-16154136, clang v8.0.1, -O3. speed XXH32 5.2 GB/s XXH64 3.6 GB/s XXH3 12 GB/s XXH128 10.5 GB/s

Even though it's supposed to be a modern aarch64 cpu, its speed is worse for XXH64, presumably due to high usage of 64-bit multiplications.

I always use pure memcpy() if in doubt to get the raw memory performance. So maybe the DDRAM is slow.

AArch64 has special load instructions that might speed up things also: "The load-store non-temporal pair instructions provide a hint to the memory system that an access is “non-temporal” or “streaming” and unlikely to be accessed again in the near future so need not be retained in data caches. However depending on the memory type they may permit memory reads to be preloaded and memory writes to be gathered, in order to accelerate bulk memory transfers. "

Cyan4973 commented 4 years ago

Okay, I believe there are enough updates to create a new release, v0.7.2. I'm actually in need of a recent version for an internal project.

This is not the end of the story, XXH3 is still labelled experimental, and we can still bring improvements to it when needed or useful enough.

Cyan4973 commented 4 years ago

ZrHa

I've been studying @svpv's ZrHa kernel, a clever evolution of XXH128's kernel for long inputs, which gets rid of the secret. And I do like it.

ZrHa core insight is the realization that if the secret's main contribution is to maintain entropy (in presence of non-random input), then the accumulator itself can endorse that role assuming it remains random enough. There are some subtle tricks to avoid the situation of "vanishing" accumulators, which are pretty well explained by @svpv in this thread.

From a quality perspective, it seems to work fine : it passes SMHasher's test, @svpv provided an avalanche diagram, and a large-scale collision analysis was recently completed, and it scores great : 306 collisions for 100Bi hashes, right in the "ideal" distribution area.

Removing the secret has many little advantages, which I believe are worth considering :

Reduction in memory requirement and instruction size
- By a few hundred bytes, but that's still significant for a small hash code
Saves a few instructions and some loading right in the hot loop
The design is inherently 128-bit
Simpler code
- No need for a scrambler stage
- No need for a 64-128 variants

The downsides are :

Creates a dependency between rounds : the accumulator must be fully updated before re-use
- The latency is large enough to castrate performance when using only 8 accumulators (see below)
All architecture variants (NEON, VSX, etc.) will have to be re-implemented
Removing the secret also impacts the small inputs algorithms, which were using it as mask
- can probably be replaced with seed and constants
Potential negative speed impact on the scalar code path
- though the kernel is still 32-bit friendly.

By far the biggest issue is the latency. In tests, keeping the current 8-accumulators model (64-bytes), SSE speed reaches 20 GB/s. That's not bad : it's faster than XXH64, though slower than current XXH3. Perhaps more importantly, AVX speed is also 20 GB/s, proving that latency is the limiting factor.

By enlarging accumulators to 128-bytes, speed is unleashed : SSE climbs to 30 GB/s, while AVX reaches 40 GB/s ! Now, that's in the right ballpark, performance becomes similar to current XXH3. (It's actually slightly faster for SSE, and slightly slower for AVX).

We could do more, and go to 256-bytes accumulators. This is by the way the accumulators' size of meowhash, and one of the reasons for its high speed. And indeed, AVX jumps to 60 GB/s ! At this speed, it's faster than anything else, even hashes using specialized hardware (AES, CRC32C, etc.) can't keep up. However, SSE remains stuck at 30 GB/s. Furthermore, performance is degraded for "small" inputs.

Indeed, more accumulators is not free : on top of larger RAM / registers budget, is also makes the final merging stage longer, the impacts of which is more sensible for smaller inputs.

On top of that, this kind of speed is in the "too much" territory, since it's already way faster than RAM, and the variant which matters most is the SSE one, which is also is more indicative of potential performance for other architectures.

So I believe that sticking to 128-bytes is a better trade-off.

Such a kernel change would represent a large refactoring of the code. Even API will be impacted. And as long as it's not "done", there is always a risk that something bad will show up, preventing the change from happening.

But I believe it can be worth it.

For discussion.

easyaspi314 commented 4 years ago

Those shuffles aren't going to be cheap on NEON, though. I posted details in the thread.

Cyan4973 commented 4 years ago

It seems that the _MM_SHUFFLE(0, 1, 2, 3) operation can be emulated with a combination of vext + vrev ?

easyaspi314 commented 4 years ago

Don't forget the multiply setup.

Now we're at 4 shuffles, and we can't take advantage of in-place vzip on v7 anymore without a temp. The shuffling is already an annoying bottleneck on the current algorithm, and we could end up at 4 GB/s (I'd have to test), the current speed of scalar. 🙁

easyaspi314 commented 4 years ago

Yeah this is looking pretty messy.

Needing a DCBA, AC, and a BD all at the same time is really painful for Neon.

pdillinger commented 4 years ago

My feedback regarding XXH3 on empty string (len==0, length==0, zero length, etc. for Ctrl+f):

Zero is a problematic value for some Bloom filter implementations, like this one, this one, and this one if it occurs more often than its "natural" frequency. Since the empty string can obviously occur with higher frequency than 1 / 2^64 (or 1 / 2^32, etc.), that is a problem for some applications.

Any other value, such as 1, would be fine for algorithms like this. For most applications, it should be OK that hash of empty is not dependent on seed--because what matters for Bloom hashing or table lookup is correlation. If all other inputs depend on the seed, there's no correlation problem.

I could, however, imagine wanting to encode entries from multiple "keyspaces" into one Bloom filter, using seed to select a "keyspace". In this case, the hash of empty string would need to depend on seed. But literally returning the seed is problematic, because naive choice of seeds, which should be perfectly fine, leads to naive correlation between empty string hashes in different keyspaces. For example, I might only need 256 different keyspaces and therefore use seeds 0-255.

Performance? I'm having trouble imagining a legitimate concern over the performance of seeded zero-length hashing. ;)

stati64 commented 4 years ago

How about shipping multiple accumulator widths?

Cyan4973 commented 4 years ago

How about shipping multiple accumulator widths?

Not compatible with a simple POD type state for the streaming mode
Save some additional conditions and limitations, running the algorithm on multiple accumulator widths result in multiple different return values. This increases confusion.

easyaspi314 commented 4 years ago

How does this look, @Cyan4973?

/* *********************************************
 * XXH3 is a new hash algorithm, featuring improved performance for both 
 * small and large inputs.
 * See full speed analysis here:
 *
 *    http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html
 *
 * This algorithm aims to make the most out of common hardware, instead of
 * relying on non-standard or difficult to emulate extensions such as
 * CRC32, AES-NI, or ugly shuffling.
 * 
 * The algorithm itself is written in portable C90 (with long long) and runs
 * great as-is, however, it is able to take advantage of common SIMD extensions
 * if these are availble.
 *
 * These SIMD implementations are also designed to be as portable as possible,
 * and the SSE2 and NEON implementations are (almost) guaranteed on all desktop
 * and mobile targets. There are also AVX2 and POWER8 versions, however, these
 * don't have the portability guarantee of the first two.
 *
 * Unlike most modern hashes, this hash function is designed for BOTH 32-bit
 * and 64-bit targets. It can take full advantage of 64-bit bandwidth, however,
 * almost all of the 64-bit arithmetic in the main loop is designed to be
 * reasonably easy to expand on a 32-bit machine.
 *
 * In order to be reasonably fast, the base requirements are a CPU that can
 * run XXH32 with the additional requirement of a 32-bit long multiply.
 *
 * There are no known CPUs which can run XXH32 without XXH3, but ARM CPUs
 * need to either be running in ARM or Thumb-2 mode for UMULL family of
 * instructions. There will be a compiler warning if this is compiled in
 * Thumb-1 mode if ARM instructions are available.
 *
 * On 32-bit scalar targets, XXH3 will usually run at roughly the speed of
 * XXH32, usually slightly faster. Considering that this is a true 64-bit
 * and 128-bit hash function, it is significantly beneficial to use it
 * over XXH32.
 *
 * 64-bit scalar targets usually see performance halfway between XXH32 and
 * XXH64, however, most of these have guaranteed SIMD support.
 *
 * SIMD versions can run ridiculously fast, and you can expect speeds 2-3x
 * faster than XXH32 or XXH64.
 *
 * XXH3 offers a 64-bit and a 128-bit version.
 * If the extra strength is not required, it is recommended to use the
 * 64-bit version due to the higher speed and easier return type 
 * manipulation. 
 *
 * The XXH3 algorithm is still considered experimental.
 * 
 * Produced results can -- and will -- change between versions.
 * Results produced by v0.7.x will NOT be comparable with results from v0.7.y.
 *
 * It's nonetheless possible to use XXH3 for temporary uses like hash
 * tables, however, avoid long-term storage until the algorithm is finalized.
 *
 * .......
 */

Cyan4973 commented 4 years ago

The content looks good @easyaspi314 !

In term of tone, I believe some of these sentences can be refactored to feel more "neutral" while effectively preserving their meaning. But since it's a matter of taste, I'll rather do the modifications directly, this is more efficient that bugging you at each minor difference.

easyaspi314 commented 4 years ago

It's fine lol, I'm used to being bugged. I have siblings. 😛

I'm going to work on commenting the XXH3 subroutines in the meantime.

easyaspi314 commented 4 years ago

Should we officially deprecate XXH32 once XXH3 is finalized?

Considering that XXH3 works on 32-bit, there isn't really any excuse to use XXH32 aside from maybe code size.

Cyan4973 commented 4 years ago

It's too soon for such decision, and there is no hurry.

easyaspi314 commented 4 years ago

Was toying with LLVM's compiler runtime, and it turns out that their 128-bit multiply is much better on x86, but garbage on any RISC CPU.

This is their algorithm demangled:

uint128 __mulddi3(uint64_t a, uint64_t b) {
  uint128 r;
  result.low = (a & 0xffffffff) * (b & 0xffffffff);
  uint64_t temp = result.low >> 32;
  result.low &= 0xffffffff;

  temp += (a >> 32) * (b & 0xffffffff);
  result.low += (temp & 0xffffffff) << 32;
  result.high = temp >> 32;
  temp = result.low >> 32;
  result.low &= 0xffffffff;

  temp += (b >> 32) * (a & 0xffffffff);
  result.low += (temp & 0xffffffff) << 32;
  result.high += temp >> 32;

  result.high += (a >> 32) * (b >> 32);
  return result;
}

Looks more like x86 assembly than C tbh.

But I noticed that it decreased the number of temporaries.

I toyed with the algorithm, and made it use only one temporary, and it compiles to only 39 instructions on i686 with clang, and zero writes to the stack. Meanwhile, Clang still generates umaal on ARMv6.

Still working on readability and haven't tested GCC yet, but this is what I have been testing.

uint128 __mulddi3(uint64_t a, uint64_t b) {
  uint128 result;
  uint64_t cross = (a >> 32) * (b & 0xffffffff);
  result.high = (a >> 32) * (b >> 32);
  result.high += cross >> 32;
  cross &= 0xffffffff;
  cross += (a & 0xffffffff) * (b >> 32);
  result.low = (a & 0xffffffff) * (b & 0xffffffff);
  cross += (result.low >> 32);
  result.high += (cross >> 32);
  result.low &= 0xffffffff;
  result.low += (cross << 32);
  return result;
}

This is the generated ASM with Clang 9.0.1:

__mulddi3:
        push    ebp
        push    ebx
        push    edi
        push    esi
        mov     esi, dword ptr [esp + 36]
        mov     ecx, dword ptr [esp + 28]
        mov     eax, esi
        mul     dword ptr [esp + 24]
        mov     edi, edx
        mov     ebx, eax
        mov     eax, esi
        mul     ecx
        mov     esi, edx
        mov     ebp, eax
        add     ebp, edi
        adc     esi, 0
        mov     eax, dword ptr [esp + 32]
        mul     ecx
        mov     edi, edx
        mov     ecx, eax
        add     ecx, ebx
        adc     edi, 0
        mov     eax, dword ptr [esp + 32]
        mul     dword ptr [esp + 24]
        add     edx, ecx
        adc     edi, 0
        add     edi, ebp
        mov     ecx, dword ptr [esp + 20]
        mov     dword ptr [ecx], eax
        mov     dword ptr [ecx + 4], edx
        mov     dword ptr [ecx + 8], edi
        adc     esi, 0
        mov     dword ptr [ecx + 12], esi
        mov     eax, ecx
        pop     esi
        pop     edi
        pop     ebx
        pop     ebp
        ret     4

This is the original algorithm we currently use

__mulddi3:
        push    ebp
        push    ebx
        push    edi
        push    esi
        push    eax
        mov     ecx, dword ptr [esp + 36]
        mov     ebx, dword ptr [esp + 40]
        mov     esi, dword ptr [esp + 28]
        mov     eax, ecx
        mul     esi
        mov     edi, edx
        mov     dword ptr [esp], eax    # 4-byte Spill
        mov     eax, ecx
        mul     dword ptr [esp + 32]
        mov     ecx, edx
        mov     ebp, eax
        mov     eax, ebx
        mul     esi
        mov     ebx, edx
        mov     esi, eax
        mov     eax, dword ptr [esp + 40]
        mul     dword ptr [esp + 32]
        add     esi, edi
        adc     ebx, 0
        add     esi, ebp
        adc     ebx, 0
        add     eax, ecx
        adc     edx, 0
        add     eax, ebx
        mov     ecx, dword ptr [esp + 24]
        mov     edi, dword ptr [esp]    # 4-byte Reload
        mov     dword ptr [ecx], edi
        mov     dword ptr [ecx + 4], esi
        mov     dword ptr [ecx + 8], eax
        adc     edx, 0
        mov     dword ptr [ecx + 12], edx
        mov     eax, ecx
        add     esp, 4
        pop     esi
        pop     edi
        pop     ebx
        pop     ebp
        ret     4

__mulddi3:
        push    {r4, lr}
        umull   lr, r12, r2, r1
        umull   r2, r4, r2, r0
        umaal   r4, lr, r3, r0
        umaal   r12, lr, r3, r1
        mov     r0, r2
        mov     r1, r4
        mov     r2, r12
        mov     r3, lr
        pop     {r4, pc}

I did note that Clang makes some poor decisions as to which registers it is using on ARM in this version, causing it to push r5 and r11, but that doesn't matter too much when inlined.

Also, the original compiler-rt code exposes a codegen derp:

__mulddi3:
        push    {r4, r5, r6, r7, r11, lr}
        umull   lr, r4, r3, r1
        umull   r5, r6, r2, r1
        umull   r12, r1, r2, r0
        adds    r5, r1, r5
        adcs    r6, lr, r6
        mov     r1, r5
        mov     r2, r6
        adc     lr, r4, #0
        umull   r4, r7, r3, r0  // r4,r7 = (u64)r3 * r0 
        umlal   r1, r2, r3, r0  // r1,r2 += (u64)r3 * r0 <<< but you literally just multiplied r3 and r0..
        adds    r0, r5, r4
        adcs    r0, r6, r7
        adc     r3, lr, #0 // you did all that just to check the carry...?
        mov     r0, r12 // and they don't even use r0
        pop     {r4, r5, r6, r7, r11, lr}
        bx      lr

Cyan4973 commented 4 years ago

I will spend the next sprint at bringing a new release of xxHash together. If there are suggestions that should make it into this release, I'm all ears. I've already planned changes for corner cases with srcSize <= 8.

edit : s/spring/sprint

aras-p commented 4 years ago

I will spend the next spring

You mean spring of 2021?

easyaspi314 commented 4 years ago

You mean spring of 2021?

No, spring of 2430.

easyaspi314 commented 4 years ago

Ok, I finally redid unified NEON.

Currently working on some documentation/cleanup of the internals.

easyaspi314 commented 4 years ago

Toying with this for the short 128-bit hashes to highlight the parallels between the two halves.

I don't know if it helps readability or not.

    {   xxh_u32 input_lo = XXH_readLE32(input);
        xxh_u32 input_hi = XXH_readLE32(input + len - 4);
        xxh_u64 acc_lo = input_lo | ((xxh_u64)input_hi << 32);
        xxh_u64 acc_hi = XXH_swap64(acc_lo);
        acc_lo ^= XXH_readLE64(secret)     + seed;
        acc_hi ^= XXH_readLE64(secret + 8) - seed;

        acc_lo ^= acc_lo >> 51;  acc_hi ^= acc_hi >> 47;
        acc_lo *= PRIME32_1;     acc_hi *= PRIME64_1;
        acc_lo += len;           acc_hi -= len;
        acc_lo ^= acc_lo >> 47;  acc_hi ^= acc_hi >> 43;
        acc_lo *= PRIME64_2;     acc_hi *= PRIME64_4;
        {   XXH128_hash_t const h128 = { XXH3_avalanche(acc_lo) /*low64*/, XXH3_avalanche(acc_hi) /*high64*/ };
            return h128;
    }   }

easyaspi314 commented 4 years ago

@Cyan4973 Is there any specific reason that s390x hasn't yet been merged into dev?

Cyan4973 commented 4 years ago

Do you mean this PR : https://github.com/Cyan4973/xxHash/pull/285 ? It has already been merged.

easyaspi314 commented 4 years ago

Cyan4973 merged 11 commits into Cyan4973:s390x from easyaspi314:s390x

Not into dev.

Cyan4973 commented 4 years ago

Indeed, you are right.

I can't remember why the patch was merged into its own feature branch ? Maybe to run independent tests ?

Anyway, it was probably planned to be merged into dev just after, but this subtlety was lost, and it remained left alone. Trying to merge it now into dev now, it produces a non negligible amount of conflicts.

Onto it.

easyaspi314 commented 4 years ago

Was testing things on different compilers and testing minor tweaks.

MSVC 19.24 for x86 generates SSE2 code that is significantly slower than it is in x64, and checking the assembly shows that it doesn't seem to be unrolling the loop.

xxhsum.exe 0.7.2 (32-bits i386 + SSE2 little endian), MSVC 19.24.28314.00, by Yann Collet
Sample of 100 KB...
XXH3_64b                 :     102400 ->   168817 it/s (16486.1 MB/s)
XXH3_64b unaligned       :     102400 ->   168815 it/s (16485.8 MB/s)
XXH3_64b seeded          :     102400 ->   159139 it/s (15540.9 MB/s)
XXH3_64b seeded unaligne :     102400 ->   159025 it/s (15529.8 MB/s)
XXH128                   :     102400 ->   167321 it/s (16339.9 MB/s)
XXH128 unaligned         :     102400 ->   167321 it/s (16339.9 MB/s)
XXH128 seeded            :     102400 ->   158050 it/s (15434.6 MB/s)
XXH128 seeded unaligned  :     102400 ->   158058 it/s (15435.4 MB/s)

xxhsum 0.7.2 (64-bits x86_64 + SSE2 little endian), MSVC 19.24.28314.00, by Yann Collet
Sample of 100 KB...
XXH3_64b                 :     102400 ->   243984 it/s (23826.5 MB/s)
XXH3_64b unaligned       :     102400 ->   240665 it/s (23502.4 MB/s)
XXH3_64b seeded          :     102400 ->   250037 it/s (24417.7 MB/s)
XXH3_64b seeded unaligne :     102400 ->   246639 it/s (24085.8 MB/s)
XXH128                   :     102400 ->   239926 it/s (23430.3 MB/s)
XXH128 unaligned         :     102400 ->   240151 it/s (23452.2 MB/s)
XXH128 seeded            :     102400 ->   245940 it/s (24017.6 MB/s)
XXH128 seeded unaligned  :     102400 ->   247279 it/s (24148.4 MB/s)

If I manually unroll the SSE2 code 4 times, I get something much faster:

xxhsum.exe 0.7.2 (32-bits i386 + SSE2 little endian), MSVC 19.24.28314.00, by Yann Collet
Sample of 100 KB...
XXH3_64b                 :     102400 ->   266229 it/s (25998.9 MB/s)
XXH3_64b unaligned       :     102400 ->   266228 it/s (25998.8 MB/s)
XXH3_64b seeded          :     102400 ->   258108 it/s (25205.9 MB/s)
XXH3_64b seeded unaligne :     102400 ->   257416 it/s (25138.3 MB/s)
XXH128                   :     102400 ->   239253 it/s (23364.6 MB/s)
XXH128 unaligned         :     102400 ->   239837 it/s (23421.6 MB/s)
XXH128 seeded            :     102400 ->   233737 it/s (22825.9 MB/s)
XXH128 seeded unaligned  :     102400 ->   233045 it/s (22758.3 MB/s)

xxhsum 0.7.2 (64-bits x86_64 + SSE2 little endian), MSVC 19.24.28314.00, by Yann Collet
Sample of 100 KB...
XXH3_64b                 :     102400 ->   321710 it/s (31417.0 MB/s)
XXH3_64b unaligned       :     102400 ->   320824 it/s (31330.5 MB/s)
XXH3_64b seeded          :     102400 ->   314594 it/s (30722.0 MB/s)
XXH3_64b seeded unaligne :     102400 ->   314304 it/s (30693.8 MB/s)
XXH128                   :     102400 ->   274966 it/s (26852.1 MB/s)
XXH128 unaligned         :     102400 ->   275961 it/s (26949.4 MB/s)
XXH128 seeded            :     102400 ->   279273 it/s (27272.8 MB/s)
XXH128 seeded unaligned  :     102400 ->   277759 it/s (27124.9 MB/s)

GCC i686 scalar is significantly faster than Clang:

xxhsum.exe 0.7.2 (32-bits i386 little endian), GCC 9.2.0, by Yann Collet
Sample of 100 KB...
XXH3_64b                 :     102400 ->    94618 it/s ( 9240.0 MB/s)
XXH3_64b unaligned       :     102400 ->    94337 it/s ( 9212.6 MB/s)
XXH3_64b seeded          :     102400 ->    96110 it/s ( 9385.8 MB/s)
XXH3_64b seeded unaligne :     102400 ->    94710 it/s ( 9249.0 MB/s)
XXH128                   :     102400 ->    76141 it/s ( 7435.7 MB/s)
XXH128 unaligned         :     102400 ->    73731 it/s ( 7200.3 MB/s)
XXH128 seeded            :     102400 ->    75995 it/s ( 7421.4 MB/s)
XXH128 seeded unaligned  :     102400 ->    73064 it/s ( 7135.1 MB/s)

xxhsum.exe 0.7.2 (32-bits i386 little endian), Clang 9.0.0 (https://github.com/msys2/MINGW-packages.git fdafa4d8c4022588676c8ec0985dafaf834258ae), by Yann Collet
Sample of 100 KB...
XXH3_64b                 :     102400 ->    57035 it/s ( 5569.8 MB/s)
XXH3_64b unaligned       :     102400 ->    56740 it/s ( 5541.0 MB/s)
XXH3_64b seeded          :     102400 ->    56848 it/s ( 5551.5 MB/s)
XXH3_64b seeded unaligne :     102400 ->    56687 it/s ( 5535.9 MB/s)
XXH128                   :     102400 ->    47836 it/s ( 4671.5 MB/s)
XXH128 unaligned         :     102400 ->    47582 it/s ( 4646.7 MB/s)
XXH128 seeded            :     102400 ->    47834 it/s ( 4671.2 MB/s)
XXH128 seeded unaligned  :     102400 ->    47129 it/s ( 4602.4 MB/s)

Re-adding XXH_FORCE_INLINE to XXH3_hashLong_internal_loop is slightly better, but still not great:

xxhsum.exe 0.7.2 (32-bits i386 little endian), Clang 9.0.0 (https://github.com/msys2/MINGW-packages.git fdafa4d8c4022588676c8ec0985dafaf834258ae), by Yann Collet
Sample of 100 KB...
XXH3_64b                 :     102400 ->    78483 it/s ( 7664.4 MB/s)
XXH3_64b unaligned       :     102400 ->    78141 it/s ( 7631.0 MB/s)
XXH3_64b seeded          :     102400 ->    77701 it/s ( 7588.0 MB/s)
XXH3_64b seeded unaligne :     102400 ->    76999 it/s ( 7519.4 MB/s)
XXH128                   :     102400 ->    61315 it/s ( 5987.8 MB/s)
XXH128 unaligned         :     102400 ->    61124 it/s ( 5969.2 MB/s)
XXH128 seeded            :     102400 ->    60948 it/s ( 5951.9 MB/s)
XXH128 seeded unaligned  :     102400 ->    60633 it/s ( 5921.2 MB/s)

Clang generates slightly better SSE2 code than GCC.
On AVX2, we might want to consider _mm256_srli_epi64 instead of _mm256_shuffle_epi32: Aside from 1st/2nd gen Ryzen's emulated AVX2, it seems that vpsrlq is equal or faster than vpshufd.
GCC's AVX2 performance is solely due to how it handles _mm256_loadu_si256

Apparently, with only -mavx2, GCC will split _mm256_loadu_si256 to _mm_loadu_si128 + _mm256_inserti128_si256, an optimization which only applies to Sandy and Ivy (which don't even have AVX2)

gcc -mavx2 -O3

xxhsum.exe 0.7.2 (64-bits x86_64 + AVX2 little endian), GCC 9.2.0, by Yann Collet
Sample of 100 KB...
XXH3_64b                 :     102400 ->   357210 it/s (34883.8 MB/s)
XXH3_64b unaligned       :     102400 ->   349488 it/s (34129.7 MB/s)
XXH3_64b seeded          :     102400 ->   355145 it/s (34682.1 MB/s)
XXH3_64b seeded unaligne :     102400 ->   349091 it/s (34090.9 MB/s)
XXH128                   :     102400 ->   400618 it/s (39122.8 MB/s)
XXH128 unaligned         :     102400 ->   400281 it/s (39089.9 MB/s)
XXH128 seeded            :     102400 ->   400000 it/s (39062.5 MB/s)
XXH128 seeded unaligned  :     102400 ->   396731 it/s (38743.2 MB/s)

gcc -mavx2 -O3 -march=haswell (or -mno-avx256-split-unaligned-load) which disables load splitting:

xxhsum.exe 0.7.2 (64-bits x86_64 + AVX2 little endian), GCC 9.2.0, by Yann Collet
Sample of 100 KB...
XXH3_64b                 :     102400 ->   538004 it/s (52539.4 MB/s)
XXH3_64b unaligned       :     102400 ->   433722 it/s (42355.6 MB/s)
XXH3_64b seeded          :     102400 ->   517874 it/s (50573.7 MB/s)
XXH3_64b seeded unaligne :     102400 ->   427260 it/s (41724.6 MB/s)
XXH128                   :     102400 ->   575281 it/s (56179.8 MB/s)
XXH128 unaligned         :     102400 ->   487777 it/s (47634.4 MB/s)
XXH128 seeded            :     102400 ->   576360 it/s (56285.2 MB/s)
XXH128 seeded unaligned  :     102400 ->   489298 it/s (47783.0 MB/s)

However, after checking the assembly, the real problem is that GCC is going waaaay too ham on the unrolling with -O3, unrolling it 16 times [!].

Compiling with -O2 gives something better:

xxhsum.exe 0.7.2 (64-bits x86_64 + AVX2 little endian), GCC 9.2.0, by Yann Collet
Sample of 100 KB...
XXH3_64b                 :     102400 ->   613364 it/s (59898.8 MB/s)
XXH3_64b unaligned       :     102400 ->   491900 it/s (48037.1 MB/s)
XXH3_64b seeded          :     102400 ->   614400 it/s (60000.0 MB/s)
XXH3_64b seeded unaligne :     102400 ->   491520 it/s (48000.0 MB/s)
XXH128                   :     102400 ->   575335 it/s (56185.1 MB/s)
XXH128 unaligned         :     102400 ->   491520 it/s (48000.0 MB/s)
XXH128 seeded            :     102400 ->   574206 it/s (56074.8 MB/s)
XXH128 seeded unaligned  :     102400 ->   491520 it/s (48000.0 MB/s)

easyaspi314 commented 4 years ago

This seems to fix GCC's codegen for AVX2.

/*
 * UGLY HACK:
 * GCC usually generates the best code with -O3 for xxHash,
 * except for AVX2 where it is overzealous in its unrolling 
 * resulting in code roughly 3/4 the speed of Clang.
 *
 * There are other issues, such as GCC splitting _mm256_loadu_si256
 * into _mm_loadu_si128 + _mm256_inserti128_si256 which is an
 * optimization which only applies to Sandy and Ivy Bridge... which
 * don't even support AVX2.
 *
 * That is why when compiling the AVX2 version, it is recommended
 * to use either
 *   -O2 -mavx2 -march=haswell
 * or
 *   -O2 -mavx2 -mno-avx256-split-unaligned-load
 * for decent performance, or just use Clang instead.
 *
 * Fortunately, we can control the first one with a pragma
 * that forces GCC into -O2, but the other one we can't without
 * "failed to inline always inline function due to target mismatch" 
 * warnings.
 */
#if XXH_VECTOR == XXH_AVX2 /* AVX2 */ \
  && defined(__GNUC__) && !defined(__clang__) /* GCC, not Clang */ \
  && defined(__OPTIMIZE__) && !defined(__OPTIMIZE_SIZE__) /* respect -O0 and -Os */
#  pragma GCC push_options
#  pragma GCC optimize("-O2")
#endif 

/* body of xxh3.h */

/* Pop our optimization override from above */
#if XXH_VECTOR == XXH_AVX2 /* AVX2 */ \
  && defined(__GNUC__) && !defined(__clang__) /* GCC, not Clang */ \
  && defined(__OPTIMIZE__) && !defined(__OPTIMIZE_SIZE__) /* respect -O0 and -Os */
#  pragma GCC pop_options
#endif

Cyan4973 / xxHash

General discussion about XXH3 #175

Purpose

Prototype

Result