XXH3 performance issue on ARM & WASM with >128byte inputs

aras-p commented 5 years ago

I was playing around with XXH3 from 0.7.0 release, and there's a curious performance drop with data sizes >128 bytes on Arm64 and on WebAssembly, but not on x64.

Here's x64 (Xcode 10.1 Clang, on a MacBookPro), things look great:

However on Arm64 (iPhone SE, compiled with Xcode 10.1 Clang) -- notice the big drop after 128 bytes input size:

On WebAssembly (Firefox 67, running in the same MacBookPro), there's a drop after 128 bytes too, where XXH3 becomes much slower than previous xxHash:

Cyan4973 commented 5 years ago

Thanks @aras-p , that's very instructive.

128-bytes is the limit between the "small input" and "large input" modes. The "large input" mode highly depends on performance of vectorial instructions. This works better for SSE2 and AVX2, as it was designed for these ones. @easyaspi314 created a specific variant for NEON, which is very helpful.

However, it's difficult to tell if the NEON code path is correctly triggered in your test. Source code is supposed to auto-sense local conditions, but the build-macro magic may be missing it. An alternative could be to force the implementation to use a specific code path, using the XXH_VECTOR build macro. For NEON, the code is 3, which gives for gcc : -DXXH_VECTOR=3.

Web Assembly is unlikely to be able to generate vectorial instructions, at least not yet. So it probably relies on the universal "scalar" code. The scalar code wasn't properly optimized in v0.7, relying too much on the compiler to do optimal transformations. In the current dev branch, it has been rewritten, in a way which is easier to optimize by most compilers, providing generally nice performance improvements (clang is even able to auto-vectorize it ...)

The rewritten "scalar" mode is expected to perform as well or better than XXH32.

aras-p commented 5 years ago

However, it's difficult to tell if the NEON code path is correctly triggered in your test

It is triggered properly, so 🤷‍♀

I'll test the dev branch soon.

aras-p commented 5 years ago

Tested the dev branch, on WebAssembly on larger inputs it gets faster indeed, but not as fast as ye olde xxHash:

On ARM (iPhoneSE), still about same speed. Both NEON and SCALAR code paths have a throughput drop at >128byte inputs.

aras-p commented 5 years ago

Testing on x64 with Visual Studio 2017 (on current dev branch) -- the SSE path does not kick in, since code checks only for __SSE2__ that is not defined by VS out of the box. Extending the check to be this instead:

#  elif defined(__SSE2__) || defined(_M_AMD64) || defined(_M_X64)
#    define XXH_VECTOR XXH_SSE2

makes the SSE2 code path be used automatically on x64, and the results are better. However at input data sizes between 128-500 bytes it still loses to xxHash64 Clipboard02

Cyan4973 commented 5 years ago

Thanks for the macro suggestion @aras-p . It will be very useful for Visual. Indeed, SSE2 is guaranteed to be present on any x64 cpu, so why not just look for this property if it's a more common build macro.

I will look again at the ARM performance, it could be disappointing at this stage. One issue with ARM performance is that it's so different from one chip to another, and from one compiler to another. It's much more difficult to draw conclusions from a single test unfortunately, which complicates development.

Regarding Wasm : thanks for this very interesting test! We know that the scalar path is slower than XXH64 on 64-bit instruction set. It's only designed to be faster than XXH32 on both 32 and 64-bit instruction set, which means it's much faster than XXH64 on 32-bit instruction set. The surprising part here is that I was expecting Wasm to use 32-bit instruction set, in which case, XXH3 should dominate XXH64. That's clearly not the case in your test. So a question is : does Wasm generate 64-bit executable now ?

Regarding the performance gap : Yes, it's known that there is a gap on reaching the "long mode". This mode features some non-trivial startup cost, which takes time to recoup. It's expected that by 512 bytes approximately, performance should be back on par, though exact mileage vary, depending on CPU type, and compiler version. To reduce this gap, one possibility could be to extend the short mode to larger inputs. So far, this solution was avoided out of fear that the core hash function becomes too large, which can be detrimental to instruction cache. Instruction cache pollution cannot be measured with a synthetic hash test, because it only matters in mixed workload (when hash is just a part of a longer pipeline). A non-unrolled loop might help to get there without this side effect. It will mess the branch predictor a bit, but I suspect this effect is only important for short inputs, hence medium-size inputs can likely afford it.

I'll look at this possibility, see if it can be integrated smoothly in the current design.

aras-p commented 5 years ago

does Wasm generate 64-bit executable now

My understanding is that Wasm has native 64-bit types (see https://github.com/WebAssembly/design/blob/master/Semantics.md#64-bit-integer-operators) that probably map almost 1:1 to hardware instructions, but "pointer sizes" are 32-bit (see https://github.com/WebAssembly/design/blob/master/Semantics.md#addressing), so by that aspect it's a 32-bit platform. But I think the 64-bit operations should be "fast" when running on an actual 64-bit CPU.

Cyan4973 commented 5 years ago

So it would be somewhat equivalent to the x32 mode in Linux ? Interesting, that's a subtle difference. And indeed, it makes wasm much more 64-bit friendly.

Cyan4973 commented 5 years ago

By the way, it seems that even SSE2 is available to Wasm : https://emscripten.org/docs/porting/simd.html

To enable SIMD, pass the -msimd128 flag at compile time

aras-p commented 5 years ago

Oooh, turns out Xcode by default does -Os optimization level when compiling for iOS, and I did not notice that. Inner loop inside XXH3_accumulate_512 was not being unrolled. Compiling with -O2 makes a much better picture:

Sorry for the confusion! The MSVC and WebAssembly points still stand, since they are compiled with /O2 and -O2 respectively.

Cyan4973 commented 5 years ago

The MSVC and WebAssembly points still stand, since they are compiled with /O2 and -O2 respectively.

Does that sound reasonable to attempt enabling SIMD flag for WebAssembly ?

aras-p commented 5 years ago

Does that sound reasonable to attempt enabling SIMD flag for WebAssembly

Maybe, but without some additional work that won't do magic. Since it requires code to not use SSE2 or NEON et al intrinsics, but rather use the Clang's special "vector types" extension (that very much looks like shader languages -- e.g. float4 etc.).

I'm not sure what is the browser support for that SIMD thing either, since today it still looks to be "experimental". Maybe I'll play around with it someday soon, stay tuned!

Cyan4973 commented 5 years ago

Wow, good point: It's not about enabling Intel's intrinsics at all ! Obvious, in retrospect... Hence, it completely relies on the scalar code path (or a new specific one designed for clang vector extension).

There is an interesting word about auto-vectorization :

This will also turn on LLVM’s autovectorization passes, so no source modifications are necessary to benefit from SIMD.

I mention it because I remember being impressed by LLVM auto-vectorization with recent versions of XXH3 (dev branch). It's not as good as the SSE2 code path, but still gets a hefty performance boost.

XXH_VECTOR	Apple llvm 10.0	gcc v8.3.0
0 "scalar"	24 GB/s	11 GB/s
1 "sse2"	30 GB/s	27 GB/s
2 "avx2"	55 GB/s	45 GB/s

Notice the huge speed difference for the scalar path, which I interpret as "llvm was able to auto-vectorize the scalar path".

Cyan4973 commented 5 years ago

Updated SSE2 detection macro in dev branch for Visual, following your suggestion

Cyan4973 commented 5 years ago

As suggested by your graph, it might be preferable to extend the "short mode" to longer distances. In the midsize feature branch, I'm trying to find an acceptable threshold for this extension.

It seems there won't be a "good size for everybody". The "ideal" threshold vary a lot depending on target and compiler version. For the time being, this feature branch settles on 384 bytes. It might be a bit high, since I can measure a loss of performance in the 250-384 range on my macOS laptop when using default compiler. The situation though is different when using gcc on the same laptop. And I can guess different cpu or compiler can get different results.

Anyway, I would appreciate any feedback on this topic, should you feel that the threshold would better be positioned at some different value.

easyaspi314 commented 5 years ago

It really does matter for the compiler. Clang really excels at this code, but I found both MSVC and GCC struggling to generate decent code. IDK why, it is pretty damn obvious what code we want... :confused:

I have considered doing a manual partial unroll like we do for XXH32/64.

Two iterations per branch seems to be beneficial on most platforms, and any larger seems to be inconsistent. x86, however, seems to prefer the entire thing to be unrolled. :thinking:

I'll try to get back to work on this next week or so. I still have a few classes to finish up.

As for the threshold, I think there is nothing wrong with bumping it up a bit.

Cyan4973 commented 5 years ago

OK, so I've been running a long serie of performance tests during the week-end, to target a more precise "threshold" between the short and long mode.

With gcc v8.3.0, the tested 384 bytes threshold feels relatively good : it's right in this section where the performance of short and long modes overlap (it's not a straight line, there is no "single point" where they become equal).

For clang v8.0, 384 is too large though. Speed losses (when using the "short" mode) start before reaching 200 bytes. They only increase afterwards.

For Visual, 384 is way too small! That's because the performance cliff on reaching the "long mode" (using SSE2) is extreme : there is 4x factor on moving from "short" to "long" mode ! I must say I wasn't expecting such a huge impact, since I mostly use gcc and clang on a daily basis. Visual needs almost 1 KB of input before short and long modes performance start to overlap.

So, based on these figures, I'm currently considering to update the threshold to something like 288 bytes, leaving only a reduced section where performance is detrimental for both gcc and clang, in opposite directions, and said difference is "reasonable enough".

For Visual, compared to current threshold (128 bytes), this will only help a little, by delaying the threshold at which the performance cliff happens, and making it smaller (<2x). But it will still be there.

I suspect that Visual requires a dedicated investigation, in order to tame, or solve, the performance cliff observed on reaching the SSE2 mode.

To give an idea of the performance gap between compilers, at 300 bytes input, presuming long mode (as now) using SSE2, Visual generates ~26M hashes/sec (throughput, fixed size), gcc-8 is at ~42M hashes/sec, and clang v8 reaches ~58M hashes/sec. So it's a pretty large difference.

note : in contrast, performance differences for the "short" mode are much lighter : still presuming 300 bytes input, throughput tests put gcc-8 at ~49M.H/s, while both Visual and clang are ~46M.H/s. clang is way better for vectorial code, and the hope is that this is the long-term tendency : compilers will improve, and sometime in the future, eventually become as good as clang is today.

aras-p commented 5 years ago

Great investigation! BTW which MSVC version you used for tests? My tests were on VS2017 15.9 version; looking at release notes they supposedly have improved SIMD related optimizations in VS2019 quite a bit (but I haven't checked with that myself).

And yeah 384 bytes cutoff for long mode sounds like a good balance.

Cyan4973 commented 5 years ago

I was using Visual Studio 2017, though I did not checked the exact version.

I think I have another VM with Visual Studio 2019, so I could check out. The problem of this VM is that it's unsuitable for precise benchmarks, due to environmental noise, but it should be decent enough to evaluate if vectorial performance has been improved or not.

edit : from what I can tell, Visual 2019 doesn't help much for the performance impact of switching to SSE2 mode. I find the same roughly x4 impact when switching from "short" to "long" mode, using current cut-off threshold (128-129 bytes).

I've found the MS blog where they announce vector code optimizations for VS2019 : https://devblogs.microsoft.com/cppblog/game-performance-and-compilation-time-improvements-in-visual-studio-2019/

I couldn't compare VS2017 vs VS2019 directly. It might be that 2019 is indeed better. It's just still pretty bad in term of startup time. The "steady" performance is decent through (22 GB/s, vs 26 GB/s for gcc and 28 GB/s for clang).

Cyan4973 commented 5 years ago

Hi @aras-p , what would you recommend to test Wasm performance ?

Cyan4973 commented 5 years ago

For information, in dev branch, the code has been updated, in preparation for the upcoming streaming mode. This step was required to allow users to provide a "secret" of any kind, to alter the hash result.

A side effect of this change is that the speed on Visual Studio 2019 has been greatly increased when enabling AVX2 code generation. On my laptop, it now reaches ~50 GB/s, which is > 2x faster than the SSE2 version, for some reason. The SSE2 path doesn't benefit as much, gaining less than 10%, maybe because it was already correct.

I still have to integrate the "midsize" patch. I must first ensure this extension doesn't introduce bad quality hashes on the considered range.

aras-p commented 5 years ago

what would you recommend to test Wasm performance

My little testbed is here, https://github.com/aras-p/HashFunctionsTest -- I can try the latest dev branch and see whether anything has changed on Wasm

Cyan4973 commented 5 years ago

The scalar code has not changed much since last test, so results are expected to be similar.

One objective would be to introduce the -msimd128 flag somewhere in the build chain, to see if the clang compiler is able to auto-vectorize the scalar code path in wasm, the same way it can auto-vectorize native x64 assembly.

aras-p commented 5 years ago

Hmm interesting, on latest dev (ac8c5b2) it's mostly the same, then starts to be slightly slower past 128 bytes and up to 1kb, and then there's a perf drop beyond 1kb input size, compared to previous version. I see this in Wasm both on Chrome 75 and Firefox 67.

aras-p commented 5 years ago

I tried using -msimd128 argument for Emscripten, but eventually the result is SIMD is used, but not supported in WASM mode yet error in it. I'm using latest stable toolchain, as far as I can tell, so maybe the whole SIMD ordeal is still "too early" at this point on Wasm.

Cyan4973 commented 5 years ago

The drop at 1 KB is really strange, and really strong.

The one thing that comes to mind is that it may correspond to the "scrambler" stage, which happens every KB when using all default parameters. This stage, while important to ensure good mixing of input, is supposed to be insignificant in term of cpu usage. It shouldn't be visible in the graph.

It's unclear why it nonetheless does. It must have happened in a recent modification, since it wasn't the case in previous measurements. I'll investigate.

Cyan4973 commented 5 years ago

So, for information, I noticed a strange behavior, explained in code comments at this line : https://github.com/Cyan4973/xxHash/blob/dev/xxh3.h#L688

It's one situation where it's difficult to have it all.

clang does a great job at auto-vectorizing SSE2 (26 GB/s) if this function is static, but completely fail if it is FORCE_INLINE (13 GB/s). On the other hand, if AVX2 is available, clang will auto-vectorize very well (45 GB/s) if this function is inlined, while if it's not, it will basically provide the same result as SSE2 (26 GB/s).

On another front, Visual Studio clearly has a preference for FORCE_INLINE, featuring a big benefit for AVX2(45 GB/s), and some more minor benefit for SSE2 (24 GB/s).

I haven't been able to test Wasm performance (both systems where I installed emscripten failed, for different environmental reasons). But I suspect the strange drop at 1 KB might be related, either to the same function, or possibly to inlining or not this other one.

Life would have been too simple if there was one setting good for everyone...

aras-p commented 5 years ago

For Wasm, inlining vs static vs non-inlining there does not seem to affect things much. What does seem to help, as in no more performance dip for >1kb data sizes, is removing this from XXH3_accumulate, i.e. no longer forcing the unroll:

#if defined(__clang__) && !defined(__OPTIMIZE_SIZE__) && !defined(__ARM_ARCH)
#  pragma clang loop unroll(enable)
#endif

easyaspi314 commented 5 years ago

That makes sense, less code to translate

Cyan4973 commented 5 years ago

OK, I guess we could special case Wasm, and remove the unrolling statement in this case.

It's unexpected though that this is the reason, since this unrolling statement has been there for a while now, was already present in v0.7.0, and it was not producing such a big drop at 1 KB back then.

Why would it now be a problem is mysterious. Maybe some combination of FORCE_INLINE and unrolling ?....

Cyan4973 commented 5 years ago

Strangely enough, for wasm, I'm unable to observe neither the performance drop at 129 (well, there is one, but it is small) nor the drop at 1 KB (almost non-existant), on my test platform. Note sure what's happening, maybe some difference in exact tooling ?...

I will nonetheless proceed with extending the "short mode" to longer lengths, since the performance drop at 129 bytes is clearly visible on native code.

Cyan4973 commented 5 years ago

With latest updates of 128-bit variant merged, XXH3 is more or less ready for a new release.

It integrates the mid-size strategy discussed here, to reduce the performance gap after 128 bytes. Streaming mode for long inputs is available, for both 64 and 128-bit variants. Performance of both variants should follow a roughly similar shape, with the 64-bit one being generally ~15% faster.

If you wish @aras-p , you may be interested in benchmarking this latest version (in dev branch) before it becomes a release, so that it get a chance to incorporate potential feedback.

aras-p commented 5 years ago

Ok so here's a comparison on various platforms/compilers; I've only tested the 128-bit variant since that's what I'm interested in :) In all the graphs I've included the previous xxHash dev branch variant from June 6 (0117f18), as well as current dev branch from July 26 (f012d4e). The x64 PC tests also include Meow hash (the one that uses AES-NI instructions), and as expected it runs circles around other general purpose hashes, going up to 35GB/s or so. My little testbed project at same place as before, https://github.com/aras-p/HashFunctionsTest

MacBookPro 2018 (Core i9 2.9GHz), Xcode 10.2.1, x64. The perf dip moved to larger input sizes, and at larger sizes the new XXH3 seems to be a bit slower than before.

PC (AMD TR 1950X 3.4GHz), Visual Studio 2017 (15.9), x64. Similar to Mac, the perf dip moved to larger sizes, new XXH3 seems to be a bit slower than before, up untip even larger sizes where it becomes faster.

Same as above, just with Visual Studio 2019 (16.2). XXH3 results are improved a bit across whole range of input sizes, I suspect due to SIMD optimizer codegen improvements in 2019. General performance profile similar.

iPhone/ARM (Apple A9), ARM64. The perf dip at midsizes is gone, but at larger input sizes hashing is slower than before.

Now for 32-bit-ish platforms. WebAssembly, on 2018 MacBookPro, Chrome 75. General perf profile similar, hashing slightly slower at larger input sizes.

PC (AMD TR 1950X 3.4GHz), Visual Studio 2017 (15.9), x86 i.e. 32-bit. Something in XXH3 has changed that makes it much faster now at larger sizes, nice!

Same as above, just with Visual Studio 2019 (16.2). Very similar results.

Cyan4973 commented 5 years ago

Thanks for this very thorough and insightful analysis @aras-p .

Some thoughts :

Speed of new version : I expect the new XXH128 to be ~10% slower than previous one. This is the consequence of altering the design to answer a grievance on previous version, namely, that a combination of 64-bit mixers doesn't make a 128-bit hash. While I would agree if the goal was a cryptographic-level hash, I'm not convinced it's as important for a fast hash merely offering good dispersion qualities, because it takes a very coordinated scenario to degrade the 128-bit hash into a 64-bit one. Yet, there is some polarization on the topic, and I don't want to waste energy on it. Therefore, now, every input impacts 128-bit of accumulator immediately. This translates into some additional runtime cost, which feels reasonable, but is still a cost. I'm also surprised that it can sometimes be faster than previous version, but maybe something is better "streamlined" in the new version.

Performance dip : The 128-bit final mixer has a larger fixed cost than the 64-bit one, expectably. Therefore, it takes more time to dilute that cost, meaning the mid-size performance dip is larger. But the new mid-size mode, introduced for XXH3_64bit() and re-employed for XXH3_128bit(), still uses the same size threshold as 64-bit. The goal was to make it compatible with existing XXH3_state_t object used in streaming mode, and to share most of the streaming code. This can be discussed. If this is considered a sub-optimal choice, it could be updated.

Comparison with AES-based algorithm : SSE2 is not enough to face special-purpose hardware. XXH3 is designed to leverage AVX2, when available, in order to reach similar performance territory. For example, on my laptop, it reaches ~45 GB/s on "large" data. Yet, while SSE2 is a guaranteed capability of any x64 cpu, later SIMD versions are not. I suspect that's the reason why AVX2 wasn't benchmarked.

Wasm performance : yep, we still have this issue that, at its core, XXH3 is a vectorizable algorithm using 32-bit operations, and it's comparatively less efficient when facing a 64-bit instruction set without vectorization capability. This generally does not happen (x64 has SSE2 necessarily, arm64 has NEON necessarily, etc.) but wasm is one of these cases where it does. I'm afraid I don't have a great solution for that yet ....

Visual Studio x86 performance : I presume your new detection macro makes it possible for Visual to pick-up the SSE2 mode, it's likely the reason for this great speed up :) Thanks !

aras-p commented 5 years ago

Yup, it all makes sense, thanks for detailed explanation! And indeed comparison with Meow hash is not fair, especially since I haven't compiled XXH3 with AVX2 either.

easyaspi314 commented 5 years ago

Hmm So I was checking the wasm assembly.

For starters, let's force the manual 128-bit multiply on wasm.

Clang defines __uint128_t on wasm, but silently emits a call to __multi3.

This not only has the overhead of a function call, but it also results in two redundant 64->64 multiplies by zero. While this only wastes a few cycles on 64-bit targets, you are probably looking at 25+ cycles each multiply on a 32-bit browser.

Obviously, benchmarks will tell the truth. @aras-p, can you try adding !defined(__wasm__) && to the __SIZEOF_INT128__ check and run the bench again?

Also, try disabling the unroll pragma. __wasm__ should be checked as well.

Cyan4973 commented 5 years ago

can you try adding !defined(__wasm__) && to the __SIZEOF_INT128__ check and run the bench again?

I've tested this suggestion, and it seems to work well, at least on my laptop. Performance seems improved by ~10%. This improvement is only valid for small inputs, which use the 128-bit multiply, from size 9 to 240. That's still welcome. I can certainly add it to xxh3.h.

update : done

Cyan4973 commented 5 years ago

try disabling the unroll pragma. wasm should be checked as well.

It's supposed to be already done in this version : https://github.com/Cyan4973/xxHash/blob/dev/xxh3.h#L763

though the check uses the __EMSCRIPTEN__ build macro.

Cyan4973 commented 5 years ago

I guess this topic can be closed now, since the introduction of the "middle-range" algorithm for len <= 240 has improved situation. Of course, it can be re-opened of need be.

Cyan4973 / xxHash

XXH3 performance issue on ARM & WASM with >128byte inputs #211