Closed easyaspi314 closed 2 years ago
We can probably make multiple small steps progressively in this direction.
Performance matters, since it's an essential property attached to xxhash.
I'm sure there are several improvements or code simplifications that wouldn't impact performance, or barely,
but if a "dumb" compiler, say MSVC /O2
, see important speed regressions, it matters too.
So, as usual, it's a matter of balance.
Also, this effort might be partially linked to #550 .
but if a "dumb" compiler, say MSVC /O2, see important speed regressions...
(╯°□°)╯︵ ┻━┻
C:\code\xxhash> xxhsum.exe -b1
xxhsum.exe 0.8.1 by Yann Collet
compiled as 32-bit i386 + SSE2 little endian with MSVC 19.29.30137.00
Sample of 100 KB...
1#XXH32 : 102400 -> 48731 it/s ( 4758.9 MB/s)
C:\code\xxhash> xxhsum-outline-reroll.exe -b1
xxhsum-outline-reroll.exe 0.8.1 by Yann Collet
compiled as 32-bit i386 + SSE2 little endian with MSVC 19.29.30137.00
Sample of 100 KB...
1#XXH32 : 102400 -> 26091 it/s ( 2548.0 MB/s)
Why is msvc x86 allergic to unrolling fixed iteration loops?
Edit: Outlining and extracting without rerolling seems to be fine though...
This is what I was thinking. It uses some of the naming styles from XXH3.
It looks good to me
I think for XXH64, we should just use a nested loop for the bulk loop, as long as MSVC x64 unrolls it (but MSVC x64 is more liberal in unrolling anyways)
64-bit arithmetic is going to be hot garbage on MSVC x86 anyways thanks to _allmul
calls, and GCC and Clang know how to unroll it.
Side note: Extracting XXH64's internals in the same way somehow gave a slight boost to ARMv7-a with Clang 13 (1.5GB/s -> 1.7GB/s), even though it was inlined and unrolled just like before. 🤔
Draft at easyaspi314:modern_xxh32_xxh64
. I will make a PR once I do some benchmarking.
I also changed the mem32/mem64 fields to unsigned char arrays which shouldn't break binary ABI.
Should we remove XXH_OLD_NAMES as well?
Should we remove XXH_OLD_NAMES as well?
Let's plan that for v0.9.0
On a side note, I was toying with a mixed NEON/scalar XXH64.
On my Pixel 4a, clang and GCC get the same 2804 MB/s normally, but with half NEON and half scalar, Clang gets 3156 MB/s and GCC gets 2925 MB/s.
Since I already have the code I might as well make ARMv7-A do full NEON, and that actually gets 2704 MB/s on Clang compared to ~1GB/s normally.
However, the implementation is pretty ugly:
Side side note: Would a mixed SIMD/scalar benefit XXH3 as well? The integer pipeline is basically unused during hashLong, and we might benefit from doing a few lanes scalar.
Edit: Holy shit, it does (at least on aarch64). Doing a 6:2 split on the NEON path on clang makes it jump from 8.8 GB/s to 10.2 GB/s.
For XXH64
, I would rather preserve code simplicity, the very minor performance difference seems not worth it,
For XXH3
on the other hand, since we already manage multiple specialized code paths, a ~+15% performance increase is definitely large enough to justify updating the aarch64
implementation. A complex bonus question though is, will it be beneficial (with various degrees) on all arch64
, or beneficial for some, detrimental for others ? Difficult to tell.
It only seems to affect AArch64, but XXH3 runs incredibly with a 6:2 ratio in #632, even (mostly) fixing the lackluster performance from GCC (30% faster, but still slower than clang lol).
XXH64 definitely isn't worth it especially if it still can't beat XXH32.
Is this topic still active ? Should we keep this issue opened ? Referring to the XXH32/XXH64 modernization effort in the title, not later topics appearing in the thread.
Idea: XXH32 and XXH64 could be enhanced like so:
mem32/mem64
not beingunsigned char
Pros:
Cons: