Warnings in xxh3.h on MSVC

pitrou commented 5 years ago

I get this on AppVeyor when building with 0.7.1 and XXH_INLINE_ALL:

c:\projects\arrow\cpp\src\arrow\vendored\xxhash\xxh3.h(1372): error C2220: warning treated as error - no 'object' file generated
c:\projects\arrow\cpp\src\arrow\vendored\xxhash\xxh3.h(1372): warning C4146: unary minus operator applied to unsigned type, result still unsigned
c:\projects\arrow\cpp\src\arrow\vendored\xxhash\xxh3.h(1379): warning C4146: unary minus operator applied to unsigned type, result still unsigned
c:\projects\arrow\cpp\src\arrow\vendored\xxhash\xxh3.h(1383): warning C4146: unary minus operator applied to unsigned type, result still unsigned

Cyan4973 commented 5 years ago

Thanks for notification @pitrou , we probably need an additional test in AppveyorCI to catch this issue.

Cyan4973 commented 5 years ago

Actually, the appveyorCI found the issues, as you mentioned : https://ci.appveyor.com/project/YannCollet/xxhash/build/job/todic2778t438b7u#L59

but they were classified as silent warnings, instead of blocking errors.

So what we are lacking is an ability to make the test fail when it generates compilation warnings.

wesm commented 5 years ago

Passing -DCMAKE_C_FLAGS="/WX" in the Visual Studio CMake build should do the trick

Cyan4973 commented 5 years ago

Thanks @wesm , this command works !

edit : well, it almost worked. If using -DCMAKE_C_FLAGS=#### after the project was already created and compiled properly, then it works. But if I start from an empty directory, then the script seems to be missing generation of xxhash.lib, which makes the linking stage fails afterwards ...

Cyan4973 commented 5 years ago

Hi @pitrou,

I was looking at ARROW repository recently, and noticed the following comment :

Use xxh3 instead of custom hashing code for non-tiny strings (...) Specialize for small hash strings (...) Even XXH3 isn't quite as fast.

The next lines develop a custom algorithm for inputs <= 16 byte, using the same read access method as XXH3, but with a different mixing algorithm. It suggests the custom method is considered faster for this use case.

There is of course nothing in doing so, I'm just curious.

XXH3 was developed with the hope that it could fix most hash / checksum situations, including hashing small inputs to generate random bits for indexing hash tables. The goal is to simplify hashing for user applications, by offering a one-stop shop for all needs. I see that it's apparently not the case for ARROW, and I'm wondering if there's something important that could be learned from it, eventually allowing XXH3 to improve and get closer to the intended goal.

edit : fixed project name : ARROW, within a final S

easyaspi314 commented 5 years ago

Well to be fair, XXH3's short hash is not as trivial as it seems, at least on A64.

Here is my best (read: not that good lol) handwritten asm version of XXH3_len_0to16_64b:


    .type XXH3_len_0to16_64b, @function
XXH3_len_0to16_64b:
    cmp    x1, #8
    b.gt   .LXXH3_len_9to16_64b    // TAIL CALL
    cmp    x1, #4
    b.ge   .LXXH3_len_4to8_64b     // TAIL CALL
    cbnz   x1, .LXXH3_len_1to3_64b // TAIL CALL
    mov    x0, xzr
    ret

.LXXH3_len_1to3_64b:
    ldrb   w4, [x0]
    lsr    x5, x1, #1
    ldrb   w5, [x0, x5]
    orr    w4, w4, w5, lsl #8
    sub    x5, x1, #1
    ldrb   w5, [x0, x5]
    orr    w4, w4, w5, lsl #16
    orr    w4, w4, w1, lsl #24
    ldr    w8, [x2]
    add    x3, x3, x8
    eor    x3, x3, x4
    movl   x8, PRIME64_1
    mul    x0, x3, x8
    // FALLTHROUGH

    .type XXH3_avalanche, @function
XXH3_avalanche:
    movl   x1, PRIME64_3
    eor    x0, x0, x0, lsr #37
    mul    x0, x0, x1
    eor    x0, x0, x0, lsr #32
    ret
.LXXH3_len_4to8_64b:
    ldr    w8, [x0]
    sub    x5, x1, #4
    movl   x6, PRIME64_2
    ldr    w5, [x0, x5]
    orr    x8, x8, x5, lsl #32
    ldr    x4, [x2]
    add    x4, x4, x3
    eor    x4, x4, x8
    eor    x4, x4, x4, lsr #51
    movl   w3, PRIME32_1
    madd   x0, x4, x3, x1
    eor    x0, x0, x0, lsr #47
    mul    x0, x0, x6
    b      XXH3_avalanche // TAIL CALL
.LXXH3_len_9to16_64b:
    ldr    x8, [x0]
    ldr    x4, [x2]
    add    x4, x4, x3
    eor    x8, x8, x4 // x8 = ll1
    sub    x6, x1, #8
    ldr    x6, [x0, x6]
    ldr    x2, [x2, #8]
    sub    x2, x2, x3
    eor    x3, x2, x6 // x3 = ll2
    mul    x4, x3, x8
    umulh  x5, x3, x8
    eor    x0, x5, x4
    add    x8, x8, x3
    add    x0, x0, x1
    add    x0, x0, x8
    b      XXH3_avalanche // TAIL CALL

(movl expands to mov/movk)

Granted, most of that is loading values which isn't changed.

pitrou commented 5 years ago

@Cyan4973

(Nitpick: the project name is "Arrow", not "Arrows" :-))

The next lines develop a custom algorithm for inputs <= 16 byte, using the same read access method as XXH3, but with a different mixing algorithm.

Well, that's an interesting coincidence, because the "custom algorithm for inputs <= 16 byte" was developed before switching to xxh3. So I'm glad to learn the code I wrote makes sense :-)

We're using this benchmark for string hashing: https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/hashing_benchmark.cc#L80-L112

With the custom algorithm in place, I get the following results (gcc 7.4.0, Ryzen 7 1700, Ubuntu 18.04):

HashSmallStrings       59455 ns        59440 ns        47088 bytes_per_second=3.42346G/s items_per_second=336.475M/s
HashMediumStrings     171256 ns       171211 ns        16345 bytes_per_second=7.57206G/s items_per_second=116.815M/s
HashLargeStrings      131766 ns       131732 ns        21437 bytes_per_second=14.5461G/s items_per_second=15.1824M/s

If I disable the custom algorithm (thus always using xxh3 regardless of length), the results are as follow:

HashSmallStrings       77020 ns        76999 ns        36339 bytes_per_second=2.64276G/s items_per_second=259.743M/s
HashMediumStrings     165283 ns       165236 ns        16950 bytes_per_second=7.84587G/s items_per_second=121.039M/s
HashLargeStrings      131967 ns       131932 ns        21421 bytes_per_second=14.524G/s items_per_second=15.1593M/s

Similarly on less micro-benchmarks relying on a hash table. With custom algorithm:

BuildStringDictionaryArray          1446292457 ns   1445849113 ns            1 bytes_per_second=236.218M/s

UniqueString10bytes/4194304/1024     71590401 ns     71559956 ns           10 bytes_per_second=558.972M/s
UniqueString10bytes/4194304/10240   104562043 ns    104519279 ns            7 bytes_per_second=382.705M/s
UniqueString100bytes/4194304/1024   182032838 ns    181926192 ns            4 bytes_per_second=2.14716G/s
UniqueString100bytes/4194304/10240  250706644 ns    250534700 ns            3 bytes_per_second=1.55917G/s

Without custom algorithm:

BuildStringDictionaryArray          1555853820 ns   1555344106 ns            1 bytes_per_second=219.588M/s

UniqueString10bytes/4194304/1024     81888786 ns     81862158 ns            8 bytes_per_second=488.626M/s
UniqueString10bytes/4194304/10240   112586934 ns    112535413 ns            6 bytes_per_second=355.444M/s
UniqueString100bytes/4194304/1024   188276504 ns    188173700 ns            4 bytes_per_second=2.07587G/s
UniqueString100bytes/4194304/10240  254715046 ns    254569574 ns            3 bytes_per_second=1.53445G/s

So, there is a noticeable slowdown, though far from catastrophic. Perhaps it's just this particular compiler. If you want I can check tomorrow with clang?

Finally, to be honest, if we had used xxh3 from the start, I'm not sure I'd have bothered with a custom algorithm for tiny strings, because xxh3 is very fast already. But our pre-xxh3 hashing algorithm wasn't very good on tiny strings, so the custom algorithm was developed at the time.

Cyan4973 commented 5 years ago

Great ASM investigation @easyaspi314 !

Note that I would expect some of these operations to be folded, depending on the exact variant.

The "full" variant must load the key values from the provided custom secret, and add the seed. The "seed" variant uses the default secret : in this case, I would expect the compiler to embed the secret values directly into instructions. This would save a corresponding load instruction. The "default" variant does not even need to add the seed to the default secret, since it's 0, saving yet another operation.

Therefore, the default variant is supposed to be the fastest one, and seems the most appropriate for hash tables.

At least, that would be my expectation. Obviously, it's compiler driven, and different compilers, flags and even versions may result in different outcomes.

Note also that, providing both XXH3 with XXH_INLINE_ALL and a custom secret bundled together should allow the compiler to make the same kind of optimization when using the "full" variant, resulting in the same instructions as the "default" variant, just different constants.

pitrou commented 5 years ago

Ok, I've tried with clang 7.0. The absolute numbers are different, but the slowdown is similar (~30% slower on the tiny string hashing micro-benchmark, 5-10% slower on the hashtable benchmarks).

Cyan4973 / xxHash

Warnings in xxh3.h on MSVC #249