Closed pitrou closed 5 years ago
Thanks for notification @pitrou ,
we probably need an additional test in AppveyorCI
to catch this issue.
Actually, the appveyorCI
found the issues, as you mentioned :
https://ci.appveyor.com/project/YannCollet/xxhash/build/job/todic2778t438b7u#L59
but they were classified as silent warnings, instead of blocking errors.
So what we are lacking is an ability to make the test fail when it generates compilation warnings.
Passing -DCMAKE_C_FLAGS="/WX"
in the Visual Studio CMake build should do the trick
Thanks @wesm , this command works !
edit : well, it almost worked. If using -DCMAKE_C_FLAGS=####
after the project was already created and compiled properly, then it works. But if I start from an empty directory, then the script seems to be missing generation of xxhash.lib
, which makes the linking stage fails afterwards ...
Hi @pitrou,
I was looking at ARROW repository recently, and noticed the following comment :
Use xxh3 instead of custom hashing code for non-tiny strings (...) Specialize for small hash strings (...) Even XXH3 isn't quite as fast.
The next lines develop a custom algorithm for inputs <= 16 byte, using the same read access method as XXH3
, but with a different mixing algorithm. It suggests the custom method is considered faster for this use case.
There is of course nothing in doing so, I'm just curious.
XXH3
was developed with the hope that it could fix most hash / checksum situations, including hashing small inputs to generate random bits for indexing hash tables. The goal is to simplify hashing for user applications, by offering a one-stop shop for all needs. I see that it's apparently not the case for ARROW, and I'm wondering if there's something important that could be learned from it, eventually allowing XXH3
to improve and get closer to the intended goal.
edit : fixed project name : ARROW, within a final S
Well to be fair, XXH3's short hash is not as trivial as it seems, at least on A64.
Here is my best (read: not that good lol) handwritten asm version of XXH3_len_0to16_64b:
.type XXH3_len_0to16_64b, @function
XXH3_len_0to16_64b:
cmp x1, #8
b.gt .LXXH3_len_9to16_64b // TAIL CALL
cmp x1, #4
b.ge .LXXH3_len_4to8_64b // TAIL CALL
cbnz x1, .LXXH3_len_1to3_64b // TAIL CALL
mov x0, xzr
ret
.LXXH3_len_1to3_64b:
ldrb w4, [x0]
lsr x5, x1, #1
ldrb w5, [x0, x5]
orr w4, w4, w5, lsl #8
sub x5, x1, #1
ldrb w5, [x0, x5]
orr w4, w4, w5, lsl #16
orr w4, w4, w1, lsl #24
ldr w8, [x2]
add x3, x3, x8
eor x3, x3, x4
movl x8, PRIME64_1
mul x0, x3, x8
// FALLTHROUGH
.type XXH3_avalanche, @function
XXH3_avalanche:
movl x1, PRIME64_3
eor x0, x0, x0, lsr #37
mul x0, x0, x1
eor x0, x0, x0, lsr #32
ret
.LXXH3_len_4to8_64b:
ldr w8, [x0]
sub x5, x1, #4
movl x6, PRIME64_2
ldr w5, [x0, x5]
orr x8, x8, x5, lsl #32
ldr x4, [x2]
add x4, x4, x3
eor x4, x4, x8
eor x4, x4, x4, lsr #51
movl w3, PRIME32_1
madd x0, x4, x3, x1
eor x0, x0, x0, lsr #47
mul x0, x0, x6
b XXH3_avalanche // TAIL CALL
.LXXH3_len_9to16_64b:
ldr x8, [x0]
ldr x4, [x2]
add x4, x4, x3
eor x8, x8, x4 // x8 = ll1
sub x6, x1, #8
ldr x6, [x0, x6]
ldr x2, [x2, #8]
sub x2, x2, x3
eor x3, x2, x6 // x3 = ll2
mul x4, x3, x8
umulh x5, x3, x8
eor x0, x5, x4
add x8, x8, x3
add x0, x0, x1
add x0, x0, x8
b XXH3_avalanche // TAIL CALL
(movl
expands to mov/movk
)
Granted, most of that is loading values which isn't changed.
@Cyan4973
(Nitpick: the project name is "Arrow", not "Arrows" :-))
The next lines develop a custom algorithm for inputs <= 16 byte, using the same read access method as XXH3, but with a different mixing algorithm.
Well, that's an interesting coincidence, because the "custom algorithm for inputs <= 16 byte" was developed before switching to xxh3. So I'm glad to learn the code I wrote makes sense :-)
We're using this benchmark for string hashing: https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/hashing_benchmark.cc#L80-L112
With the custom algorithm in place, I get the following results (gcc 7.4.0, Ryzen 7 1700, Ubuntu 18.04):
HashSmallStrings 59455 ns 59440 ns 47088 bytes_per_second=3.42346G/s items_per_second=336.475M/s
HashMediumStrings 171256 ns 171211 ns 16345 bytes_per_second=7.57206G/s items_per_second=116.815M/s
HashLargeStrings 131766 ns 131732 ns 21437 bytes_per_second=14.5461G/s items_per_second=15.1824M/s
If I disable the custom algorithm (thus always using xxh3 regardless of length), the results are as follow:
HashSmallStrings 77020 ns 76999 ns 36339 bytes_per_second=2.64276G/s items_per_second=259.743M/s
HashMediumStrings 165283 ns 165236 ns 16950 bytes_per_second=7.84587G/s items_per_second=121.039M/s
HashLargeStrings 131967 ns 131932 ns 21421 bytes_per_second=14.524G/s items_per_second=15.1593M/s
Similarly on less micro-benchmarks relying on a hash table. With custom algorithm:
BuildStringDictionaryArray 1446292457 ns 1445849113 ns 1 bytes_per_second=236.218M/s
UniqueString10bytes/4194304/1024 71590401 ns 71559956 ns 10 bytes_per_second=558.972M/s
UniqueString10bytes/4194304/10240 104562043 ns 104519279 ns 7 bytes_per_second=382.705M/s
UniqueString100bytes/4194304/1024 182032838 ns 181926192 ns 4 bytes_per_second=2.14716G/s
UniqueString100bytes/4194304/10240 250706644 ns 250534700 ns 3 bytes_per_second=1.55917G/s
Without custom algorithm:
BuildStringDictionaryArray 1555853820 ns 1555344106 ns 1 bytes_per_second=219.588M/s
UniqueString10bytes/4194304/1024 81888786 ns 81862158 ns 8 bytes_per_second=488.626M/s
UniqueString10bytes/4194304/10240 112586934 ns 112535413 ns 6 bytes_per_second=355.444M/s
UniqueString100bytes/4194304/1024 188276504 ns 188173700 ns 4 bytes_per_second=2.07587G/s
UniqueString100bytes/4194304/10240 254715046 ns 254569574 ns 3 bytes_per_second=1.53445G/s
So, there is a noticeable slowdown, though far from catastrophic. Perhaps it's just this particular compiler. If you want I can check tomorrow with clang?
Finally, to be honest, if we had used xxh3 from the start, I'm not sure I'd have bothered with a custom algorithm for tiny strings, because xxh3 is very fast already. But our pre-xxh3 hashing algorithm wasn't very good on tiny strings, so the custom algorithm was developed at the time.
Great ASM investigation @easyaspi314 !
Note that I would expect some of these operations to be folded, depending on the exact variant.
The "full" variant must load the key values from the provided custom secret
, and add the seed
.
The "seed" variant uses the default secret
: in this case, I would expect the compiler to embed the secret
values directly into instructions. This would save a corresponding load
instruction.
The "default" variant does not even need to add the seed
to the default secret
, since it's 0
, saving yet another operation.
Therefore, the default
variant is supposed to be the fastest one, and seems the most appropriate for hash tables.
At least, that would be my expectation. Obviously, it's compiler driven, and different compilers, flags and even versions may result in different outcomes.
Note also that, providing both XXH3
with XXH_INLINE_ALL
and a custom secret bundled together should allow the compiler to make the same kind of optimization when using the "full" variant, resulting in the same instructions as the "default" variant, just different constants.
Ok, I've tried with clang 7.0. The absolute numbers are different, but the slowdown is similar (~30% slower on the tiny string hashing micro-benchmark, 5-10% slower on the hashtable benchmarks).
I get this on AppVeyor when building with 0.7.1 and XXH_INLINE_ALL: