Closed aras-p closed 5 years ago
Thanks @aras-p , that's very instructive.
128-bytes is the limit between the "small input" and "large input" modes.
The "large input" mode highly depends on performance of vectorial instructions.
This works better for SSE2
and AVX2
, as it was designed for these ones.
@easyaspi314 created a specific variant for NEON
, which is very helpful.
However, it's difficult to tell if the NEON
code path is correctly triggered in your test.
Source code is supposed to auto-sense local conditions, but the build-macro magic may be missing it.
An alternative could be to force the implementation to use a specific code path, using the XXH_VECTOR
build macro. For NEON
, the code is 3
, which gives for gcc
: -DXXH_VECTOR=3
.
Web Assembly is unlikely to be able to generate vectorial instructions, at least not yet.
So it probably relies on the universal "scalar" code.
The scalar code wasn't properly optimized in v0.7
, relying too much on the compiler to do optimal transformations.
In the current dev
branch, it has been rewritten, in a way which is easier to optimize by most compilers, providing generally nice performance improvements (clang
is even able to auto-vectorize it ...)
The rewritten "scalar" mode is expected to perform as well or better than XXH32
.
However, it's difficult to tell if the NEON code path is correctly triggered in your test
It is triggered properly, so 🤷♀
I'll test the dev branch soon.
Tested the dev branch, on WebAssembly on larger inputs it gets faster indeed, but not as fast as ye olde xxHash:
On ARM (iPhoneSE), still about same speed. Both NEON and SCALAR code paths have a throughput drop at >128byte inputs.
Testing on x64 with Visual Studio 2017 (on current dev branch) -- the SSE path does not kick in, since code checks only for __SSE2__
that is not defined by VS out of the box. Extending the check to be this instead:
# elif defined(__SSE2__) || defined(_M_AMD64) || defined(_M_X64)
# define XXH_VECTOR XXH_SSE2
makes the SSE2 code path be used automatically on x64, and the results are better. However at input data sizes between 128-500 bytes it still loses to xxHash64
Thanks for the macro suggestion @aras-p . It will be very useful for Visual. Indeed, SSE2
is guaranteed to be present on any x64
cpu, so why not just look for this property if it's a more common build macro.
I will look again at the ARM performance, it could be disappointing at this stage. One issue with ARM performance is that it's so different from one chip to another, and from one compiler to another. It's much more difficult to draw conclusions from a single test unfortunately, which complicates development.
Regarding Wasm : thanks for this very interesting test! We know that the scalar path is slower than XXH64
on 64-bit instruction set. It's only designed to be faster than XXH32
on both 32 and 64-bit instruction set, which means it's much faster than XXH64
on 32-bit instruction set. The surprising part here is that I was expecting Wasm to use 32-bit instruction set, in which case, XXH3
should dominate XXH64
. That's clearly not the case in your test. So a question is : does Wasm generate 64-bit executable now ?
Regarding the performance gap : Yes, it's known that there is a gap on reaching the "long mode". This mode features some non-trivial startup cost, which takes time to recoup. It's expected that by 512 bytes approximately, performance should be back on par, though exact mileage vary, depending on CPU type, and compiler version. To reduce this gap, one possibility could be to extend the short mode to larger inputs. So far, this solution was avoided out of fear that the core hash function becomes too large, which can be detrimental to instruction cache. Instruction cache pollution cannot be measured with a synthetic hash test, because it only matters in mixed workload (when hash is just a part of a longer pipeline). A non-unrolled loop might help to get there without this side effect. It will mess the branch predictor a bit, but I suspect this effect is only important for short inputs, hence medium-size inputs can likely afford it.
I'll look at this possibility, see if it can be integrated smoothly in the current design.
does Wasm generate 64-bit executable now
My understanding is that Wasm has native 64-bit types (see https://github.com/WebAssembly/design/blob/master/Semantics.md#64-bit-integer-operators) that probably map almost 1:1 to hardware instructions, but "pointer sizes" are 32-bit (see https://github.com/WebAssembly/design/blob/master/Semantics.md#addressing), so by that aspect it's a 32-bit platform. But I think the 64-bit operations should be "fast" when running on an actual 64-bit CPU.
So it would be somewhat equivalent to the x32
mode in Linux ?
Interesting, that's a subtle difference. And indeed, it makes wasm
much more 64-bit friendly.
By the way, it seems that even SSE2
is available to Wasm :
https://emscripten.org/docs/porting/simd.html
To enable SIMD, pass the -msimd128 flag at compile time
Oooh, turns out Xcode by default does -Os
optimization level when compiling for iOS, and I did not notice that. Inner loop inside XXH3_accumulate_512
was not being unrolled. Compiling with -O2
makes a much better picture:
Sorry for the confusion! The MSVC and WebAssembly points still stand, since they are compiled with /O2
and -O2
respectively.
The MSVC and WebAssembly points still stand, since they are compiled with /O2 and -O2 respectively.
Does that sound reasonable to attempt enabling SIMD flag for WebAssembly ?
Does that sound reasonable to attempt enabling SIMD flag for WebAssembly
Maybe, but without some additional work that won't do magic. Since it requires code to not use SSE2 or NEON et al intrinsics, but rather use the Clang's special "vector types" extension (that very much looks like shader languages -- e.g. float4
etc.).
I'm not sure what is the browser support for that SIMD thing either, since today it still looks to be "experimental". Maybe I'll play around with it someday soon, stay tuned!
Wow, good point:
It's not about enabling Intel's intrinsics at all ! Obvious, in retrospect...
Hence, it completely relies on the scalar code path
(or a new specific one designed for clang
vector extension).
There is an interesting word about auto-vectorization :
This will also turn on LLVM’s autovectorization passes, so no source modifications are necessary to benefit from SIMD.
I mention it because I remember being impressed by LLVM auto-vectorization with recent versions of XXH3 (dev
branch). It's not as good as the SSE2
code path, but still gets a hefty performance boost.
XXH_VECTOR | Apple llvm 10.0 | gcc v8.3.0 |
---|---|---|
0 "scalar" | 24 GB/s | 11 GB/s |
1 "sse2" | 30 GB/s | 27 GB/s |
2 "avx2" | 55 GB/s | 45 GB/s |
Notice the huge speed difference for the scalar path, which I interpret as "llvm was able to auto-vectorize the scalar path".
Updated SSE2
detection macro in dev
branch for Visual, following your suggestion
As suggested by your graph, it might be preferable to extend the "short mode" to longer distances.
In the midsize
feature branch, I'm trying to find an acceptable threshold for this extension.
It seems there won't be a "good size for everybody". The "ideal" threshold vary a lot depending on target and compiler version. For the time being, this feature branch settles on 384 bytes. It might be a bit high, since I can measure a loss of performance in the 250-384 range on my macOS laptop when using default compiler. The situation though is different when using gcc
on the same laptop. And I can guess different cpu or compiler can get different results.
Anyway, I would appreciate any feedback on this topic, should you feel that the threshold would better be positioned at some different value.
It really does matter for the compiler. Clang really excels at this code, but I found both MSVC and GCC struggling to generate decent code. IDK why, it is pretty damn obvious what code we want... :confused:
I have considered doing a manual partial unroll like we do for XXH32/64.
Two iterations per branch seems to be beneficial on most platforms, and any larger seems to be inconsistent. x86, however, seems to prefer the entire thing to be unrolled. :thinking:
I'll try to get back to work on this next week or so. I still have a few classes to finish up.
As for the threshold, I think there is nothing wrong with bumping it up a bit.
OK, so I've been running a long serie of performance tests during the week-end, to target a more precise "threshold" between the short and long mode.
With gcc
v8.3.0, the tested 384 bytes threshold feels relatively good : it's right in this section where the performance of short and long modes overlap (it's not a straight line, there is no "single point" where they become equal).
For clang
v8.0, 384 is too large though. Speed losses (when using the "short" mode) start before reaching 200 bytes. They only increase afterwards.
For Visual
, 384 is way too small! That's because the performance cliff on reaching the "long mode" (using SSE2
) is extreme : there is 4x factor on moving from "short" to "long" mode ! I must say I wasn't expecting such a huge impact, since I mostly use gcc
and clang
on a daily basis. Visual
needs almost 1 KB of input before short and long modes performance start to overlap.
So, based on these figures, I'm currently considering to update the threshold to something like 288 bytes, leaving only a reduced section where performance is detrimental for both gcc
and clang
, in opposite directions, and said difference is "reasonable enough".
For Visual, compared to current threshold (128 bytes), this will only help a little, by delaying the threshold at which the performance cliff happens, and making it smaller (<2x). But it will still be there.
I suspect that Visual requires a dedicated investigation, in order to tame, or solve, the performance cliff observed on reaching the SSE2
mode.
To give an idea of the performance gap between compilers, at 300 bytes input, presuming long mode (as now) using SSE2
, Visual generates ~26M hashes/sec (throughput, fixed size), gcc-8
is at ~42M hashes/sec, and clang
v8 reaches ~58M hashes/sec. So it's a pretty large difference.
note : in contrast, performance differences for the "short" mode are much lighter : still presuming 300 bytes input, throughput tests put gcc-8
at ~49M.H/s, while both Visual and clang
are ~46M.H/s. clang
is way better for vectorial code, and the hope is that this is the long-term tendency : compilers will improve, and sometime in the future, eventually become as good as clang
is today.
Great investigation! BTW which MSVC version you used for tests? My tests were on VS2017 15.9 version; looking at release notes they supposedly have improved SIMD related optimizations in VS2019 quite a bit (but I haven't checked with that myself).
And yeah 384 bytes cutoff for long mode sounds like a good balance.
I was using Visual Studio 2017, though I did not checked the exact version.
I think I have another VM with Visual Studio 2019, so I could check out. The problem of this VM is that it's unsuitable for precise benchmarks, due to environmental noise, but it should be decent enough to evaluate if vectorial performance has been improved or not.
edit : from what I can tell, Visual 2019 doesn't help much for the performance impact of switching to SSE2
mode. I find the same roughly x4 impact when switching from "short" to "long" mode, using current cut-off threshold (128-129 bytes).
I've found the MS blog where they announce vector code optimizations for VS2019 : https://devblogs.microsoft.com/cppblog/game-performance-and-compilation-time-improvements-in-visual-studio-2019/
I couldn't compare VS2017 vs VS2019 directly. It might be that 2019 is indeed better. It's just still pretty bad in term of startup time. The "steady" performance is decent through (22 GB/s, vs 26 GB/s for gcc
and 28 GB/s for clang
).
Hi @aras-p , what would you recommend to test Wasm performance ?
For information,
in dev
branch, the code has been updated, in preparation for the upcoming streaming mode.
This step was required to allow users to provide a "secret" of any kind, to alter the hash result.
A side effect of this change is that the speed on Visual Studio 2019 has been greatly increased when enabling AVX2
code generation. On my laptop, it now reaches ~50 GB/s, which is > 2x faster than the SSE2
version, for some reason. The SSE2
path doesn't benefit as much, gaining less than 10%, maybe because it was already correct.
I still have to integrate the "midsize" patch. I must first ensure this extension doesn't introduce bad quality hashes on the considered range.
what would you recommend to test Wasm performance
My little testbed is here, https://github.com/aras-p/HashFunctionsTest -- I can try the latest dev
branch and see whether anything has changed on Wasm
The scalar code has not changed much since last test, so results are expected to be similar.
One objective would be to introduce the -msimd128
flag somewhere in the build chain, to see if the clang
compiler is able to auto-vectorize the scalar code path in wasm, the same way it can auto-vectorize native x64 assembly.
Hmm interesting, on latest dev (ac8c5b2
) it's mostly the same, then starts to be slightly slower past 128 bytes and up to 1kb, and then there's a perf drop beyond 1kb input size, compared to previous version. I see this in Wasm both on Chrome 75 and Firefox 67.
I tried using -msimd128
argument for Emscripten, but eventually the result is SIMD is used, but not supported in WASM mode yet
error in it. I'm using latest stable toolchain, as far as I can tell, so maybe the whole SIMD ordeal is still "too early" at this point on Wasm.
The drop at 1 KB is really strange, and really strong.
The one thing that comes to mind is that it may correspond to the "scrambler" stage, which happens every KB when using all default parameters. This stage, while important to ensure good mixing of input, is supposed to be insignificant in term of cpu usage. It shouldn't be visible in the graph.
It's unclear why it nonetheless does. It must have happened in a recent modification, since it wasn't the case in previous measurements. I'll investigate.
So, for information, I noticed a strange behavior, explained in code comments at this line : https://github.com/Cyan4973/xxHash/blob/dev/xxh3.h#L688
It's one situation where it's difficult to have it all.
clang
does a great job at auto-vectorizing SSE2
(26 GB/s) if this function is static
, but completely fail if it is FORCE_INLINE
(13 GB/s). On the other hand, if AVX2
is available, clang
will auto-vectorize very well (45 GB/s) if this function is inlined, while if it's not, it will basically provide the same result as SSE2
(26 GB/s).
On another front, Visual Studio clearly has a preference for FORCE_INLINE
, featuring a big benefit for AVX2
(45 GB/s), and some more minor benefit for SSE2
(24 GB/s).
I haven't been able to test Wasm performance (both systems where I installed emscripten
failed, for different environmental reasons). But I suspect the strange drop at 1 KB might be related, either to the same function, or possibly to inlining or not this other one.
Life would have been too simple if there was one setting good for everyone...
For Wasm, inlining vs static vs non-inlining there does not seem to affect things much. What does seem to help, as in no more performance dip for >1kb data sizes, is removing this from XXH3_accumulate
, i.e. no longer forcing the unroll:
#if defined(__clang__) && !defined(__OPTIMIZE_SIZE__) && !defined(__ARM_ARCH)
# pragma clang loop unroll(enable)
#endif
That makes sense, less code to translate
OK, I guess we could special case Wasm, and remove the unrolling statement in this case.
It's unexpected though that this is the reason, since this unrolling statement has been there for a while now, was already present in v0.7.0
, and it was not producing such a big drop at 1 KB back then.
Why would it now be a problem is mysterious. Maybe some combination of FORCE_INLINE
and unrolling ?....
Strangely enough, for wasm
, I'm unable to observe neither the performance drop at 129 (well, there is one, but it is small) nor the drop at 1 KB (almost non-existant), on my test platform. Note sure what's happening, maybe some difference in exact tooling ?...
I will nonetheless proceed with extending the "short mode" to longer lengths, since the performance drop at 129 bytes is clearly visible on native code.
With latest updates of 128-bit variant merged,
XXH3
is more or less ready for a new release.
It integrates the mid-size strategy discussed here, to reduce the performance gap after 128 bytes. Streaming mode for long inputs is available, for both 64 and 128-bit variants. Performance of both variants should follow a roughly similar shape, with the 64-bit one being generally ~15% faster.
If you wish @aras-p , you may be interested in benchmarking this latest version (in dev
branch) before it becomes a release, so that it get a chance to incorporate potential feedback.
Ok so here's a comparison on various platforms/compilers; I've only tested the 128-bit variant since that's what I'm interested in :) In all the graphs I've included the previous xxHash dev branch variant from June 6 (0117f18
), as well as current dev branch from July 26 (f012d4e
). The x64 PC tests also include Meow hash (the one that uses AES-NI instructions), and as expected it runs circles around other general purpose hashes, going up to 35GB/s or so. My little testbed project at same place as before, https://github.com/aras-p/HashFunctionsTest
MacBookPro 2018 (Core i9 2.9GHz), Xcode 10.2.1, x64. The perf dip moved to larger input sizes, and at larger sizes the new XXH3 seems to be a bit slower than before.
PC (AMD TR 1950X 3.4GHz), Visual Studio 2017 (15.9), x64. Similar to Mac, the perf dip moved to larger sizes, new XXH3 seems to be a bit slower than before, up untip even larger sizes where it becomes faster.
Same as above, just with Visual Studio 2019 (16.2). XXH3 results are improved a bit across whole range of input sizes, I suspect due to SIMD optimizer codegen improvements in 2019. General performance profile similar.
iPhone/ARM (Apple A9), ARM64. The perf dip at midsizes is gone, but at larger input sizes hashing is slower than before.
Now for 32-bit-ish platforms. WebAssembly, on 2018 MacBookPro, Chrome 75. General perf profile similar, hashing slightly slower at larger input sizes.
PC (AMD TR 1950X 3.4GHz), Visual Studio 2017 (15.9), x86 i.e. 32-bit. Something in XXH3 has changed that makes it much faster now at larger sizes, nice!
Same as above, just with Visual Studio 2019 (16.2). Very similar results.
Thanks for this very thorough and insightful analysis @aras-p .
Some thoughts :
Speed of new version : I expect the new XXH128 to be ~10% slower than previous one. This is the consequence of altering the design to answer a grievance on previous version, namely, that a combination of 64-bit mixers doesn't make a 128-bit hash. While I would agree if the goal was a cryptographic-level hash, I'm not convinced it's as important for a fast hash merely offering good dispersion qualities, because it takes a very coordinated scenario to degrade the 128-bit hash into a 64-bit one. Yet, there is some polarization on the topic, and I don't want to waste energy on it. Therefore, now, every input impacts 128-bit of accumulator immediately. This translates into some additional runtime cost, which feels reasonable, but is still a cost. I'm also surprised that it can sometimes be faster than previous version, but maybe something is better "streamlined" in the new version.
Performance dip : The 128-bit final mixer has a larger fixed cost than the 64-bit one, expectably. Therefore, it takes more time to dilute that cost, meaning the mid-size performance dip is larger. But the new mid-size mode, introduced for XXH3_64bit()
and re-employed for XXH3_128bit()
, still uses the same size threshold as 64-bit. The goal was to make it compatible with existing XXH3_state_t
object used in streaming mode, and to share most of the streaming code. This can be discussed. If this is considered a sub-optimal choice, it could be updated.
Comparison with AES-based algorithm : SSE2
is not enough to face special-purpose hardware. XXH3
is designed to leverage AVX2
, when available, in order to reach similar performance territory. For example, on my laptop, it reaches ~45 GB/s on "large" data. Yet, while SSE2
is a guaranteed capability of any x64
cpu, later SIMD versions are not. I suspect that's the reason why AVX2
wasn't benchmarked.
Wasm performance : yep, we still have this issue that, at its core, XXH3 is a vectorizable algorithm using 32-bit operations, and it's comparatively less efficient when facing a 64-bit instruction set without vectorization capability. This generally does not happen (x64
has SSE2
necessarily, arm64
has NEON
necessarily, etc.) but wasm
is one of these cases where it does. I'm afraid I don't have a great solution for that yet ....
Visual Studio x86 performance : I presume your new detection macro makes it possible for Visual to pick-up the SSE2 mode, it's likely the reason for this great speed up :) Thanks !
Yup, it all makes sense, thanks for detailed explanation! And indeed comparison with Meow hash is not fair, especially since I haven't compiled XXH3 with AVX2 either.
Hmm So I was checking the wasm assembly.
For starters, let's force the manual 128-bit multiply on wasm.
Clang defines __uint128_t
on wasm, but silently emits a call to __multi3
.
This not only has the overhead of a function call, but it also results in two redundant 64->64 multiplies by zero. While this only wastes a few cycles on 64-bit targets, you are probably looking at 25+ cycles each multiply on a 32-bit browser.
Obviously, benchmarks will tell the truth. @aras-p, can you try adding !defined(__wasm__) &&
to the __SIZEOF_INT128__
check and run the bench again?
Also, try disabling the unroll pragma. __wasm__
should be checked as well.
can you try adding
!defined(__wasm__) &&
to the__SIZEOF_INT128__
check and run the bench again?
I've tested this suggestion, and it seems to work well, at least on my laptop.
Performance seems improved by ~10%.
This improvement is only valid for small inputs, which use the 128-bit multiply, from size 9 to 240.
That's still welcome. I can certainly add it to xxh3.h
.
update : done
try disabling the unroll pragma. wasm should be checked as well.
It's supposed to be already done in this version : https://github.com/Cyan4973/xxHash/blob/dev/xxh3.h#L763
though the check uses the __EMSCRIPTEN__
build macro.
I guess this topic can be closed now, since the introduction of the "middle-range" algorithm for len <= 240 has improved situation. Of course, it can be re-opened of need be.
I was playing around with XXH3 from 0.7.0 release, and there's a curious performance drop with data sizes >128 bytes on Arm64 and on WebAssembly, but not on x64.
Here's x64 (Xcode 10.1 Clang, on a MacBookPro), things look great:
However on Arm64 (iPhone SE, compiled with Xcode 10.1 Clang) -- notice the big drop after 128 bytes input size:
On WebAssembly (Firefox 67, running in the same MacBookPro), there's a drop after 128 bytes too, where XXH3 becomes much slower than previous xxHash: