Closed pegleGrot closed 5 years ago
Thanks for notification @pegleGrot , I've removed the problematic binaries.
In the future, I'll need a way to test produced binaries to ensure they work properly beyond my local Windows VM.
Once we get my dispatcher patch merged in, we can make it so that XXH3 automagically (or manually) switches between scalar, SSE2, and AVX2 when needed, all in one binary.
We don't want to have two binaries for different microarchitectures in the full release because...
#UD
which seems like a glitch instead of an incompatibility.Building the MSYS2 build statically seems like the simplest solution.
As for the Clang build, maybe it does work properly when there is AVX2 available? I don't have a good theory for why it both doesn't crash and somehow exits gracefully without doing the actual work. Someone with a suitable CPU should test it, the newest I have here is Ivy Bridge.
@easyaspi314, are people in the normal world the target audience?
Yes as of right now there is no dispatching in the dev or the 0.7.0 branch. If you try to run the AVX2 build, it expects you to have Haswell and it is currently UB to run on an unsupported device.
And I totally hear you about target audiences. My best devices are a 2011 MacBook Pro with a dead GPU (Sandy Bridge) and an LG G3 from 2014 (Snapdragon 801/Cortex-A15).
If you would like to test the dispatching code, try my branch:
git clone -b multitarget https://github.com/easyaspi314/xxhash xxhash-multi
cd xxhash-multi
make MULTI_TARGET=1
If you want to compile with msvc, try these commands
cl.exe -O2 -c -DXXH_MULTI_TARGET=1 xxhash.c
cl.exe -O2 -c -DXXH_MULTI_TARGET=1 xxhsum.c
cl.exe -O2 -c -DXXH_MULTI_TARGET=1 -Foxxh3-scalar.obj xxh3-target.c -DXXH_VECTOR=0
cl.exe -O2 -c -DXXH_MULTI_TARGET=1 -Foxxh3-sse2.obj xxh3-target.c -DXXH_VECTOR=1 -arch:SSE2
cl.exe -O2 -c -DXXH_MULTI_TARGET=1 -Foxxh3-avx2.obj xxh3-target.c -DXXH_VECTOR=2 -arch:AVX2
cl.exe -O2 xxhsum.obj xxhash.obj xxh3-scalar.obj xxh3-sse2.obj xxh3-avx2.obj
I was looking for reference Win builds because of unexpected low speed for XXH64 x86.
As suspected, turns out it's MSVC (only tried a few combinations with /O2
, maybe solvable with more tinkering?). On Ivy Bridge, a GCC8/MinGW build from easyaspi314 (thanks) is more than twice as quick, which is in line with Cyan4973's GCC4.8 results.
With the exception of XXH64/x86, the other hash/arch combos (XXH64 x64, XXH32 x86/x64) perform practically the same on both compilers.
Curiously, Cyan4973's GCC4.8 results show XXH32 x64 as quicker than x86. I don't see that, both with VC19 and easyaspi's GCC8. Also Cyan4973's XXH64/x64 numbers, normalized, are quicker.
@easyaspi314, your multi-target changes work okay. Tried it on Ivy Bridge (SSE2), late-era Athlon 64 (SSE2), and Pentium 3 (an older build of yours used __XXH3_HASH_LONG_SCALAR, a local build from newer code showed results for both the unnamed XXH3 variant, and "Scalar").
XXH3 is still in flux so it probably doesn't matter, but a small problem: in xxh3.h
VC warns (C4244) about a narrowing conversion in the lines (642-643) that assign XXH_mult32to64()
to U64
.
Oh sweet, the Pentium III code works! I was trying to test it but I couldn't find the cord to my ancient laptop. Thanks for testing that. Hooray for incredibly excessive backwards compatibility!
I am currently trying to figure out what is going on with the performance on MSVC. I managed to make things faster with some tweaks such as using a temp buffer, however, there is a notable slowdown with both the MSVC codegen and, oddly clang-cl when compared to the same version of Clang for MinGW, WSL, or Cygwin.
Check out my notes in the PR and my recent commit messages.
If you didn't already, git pull
my latest changes which also adds CMake support for dispatching (cmake -DMULTI_TARGET=1
) and fixes a lot of issues with Windows support.
However, I am not giving MSVC an ounce of respect until it gets at least 10 GB/s as it is literally intrinsics.
Depending on the CPU it can be faster or slower. For example, older Intel chips get terrible XXH32 performance because of the slow imul
instruction (On Prescott, imul
takes more cycles than pmuludq
for no good reason).
Additionally, the target CPU changes a lot of things with Clang because Intel can't pick a fast rotate instruction and stick with it. Older chips prefer rol
, Sandy and Ivy prefer shld
, and Haswell prefers rorx
. Also, Clang tends to mess up the XXH32 and XXH64 loops by adding extra register swaps for x86/x86_64. Especially since the normal XXH32 implementation uses the exact same instructions on x86 and x86_64:
imul ebx, dword ptr [edi], -2048144777
add eax, ebx
rol eax, 13
imul eax, eax, -1640531535
Meanwhile XXH64 needs something different and each iteration requires two emulated 64-bit multiplies on 32-bit.
Depending on the chip, XXH64 can be made faster with different SSE2/SSE4/NEON routines for 32-bit (including a really fast SSE4 one for Nehalem to Ivy Bridge ONLY because of the temporary god mode pmulld
before the 5 extra cycle nerf in Haswell), but...
vld2.32
)if (32-bit and SSE4 and between Nehalem and Ivy Bridge)
XXH64_32_SSE4_Nehalem();
else if (32-bit and SSE2)
XXH64_32_SSE2();
else if (32-bit and NEON)
XXH64_NEON32();
else
XXH64_Scalar();
No need to worry about ARM in the x86 builds. :)
If you need portable x86 features detector, you may look at https://github.com/Bulat-Ziganshin/FARSH/blob/master/benchmark/CpuID.h
I built win32/64 binaries with mingw-gcc from regular v0.7.0 release, without msys*dll dependencies (except msvcrt.dll). Tested them for compliance and they work fine. Use them if you want. From my tests they are slightly faster than of newer compilers.
v0.7.1
includes new Windows binaries, which were compiled with -static
to avoid any dependency. Tested in a standard cmd
shell.
Thanks Cyan and tansy.
The two released Windows xxhsum binaries don't work, in different ways.
The GCC SSE2 build depends on external DLLs and can't run by default. Adding
msys-2.0.dll
and other related DLLs results in "The application was unable to start correctly (0xc000007b)." Maybe a version mismatch?The Clang AVX2 build does nominally run but shows no hash or benchmark results, on a non-AVX2 CPU (no AVX2 hardware here). Running with
-b
shows the following and exits. It doesn't crash, so it's unclear if the behavior is due to missing AVX2.Official binaries would be useful to have as a speed reference.