Cyan4973 / xxHash

Extremely fast non-cryptographic hash algorithm
http://www.xxhash.com/
Other
9.16k stars 777 forks source link

v0.7.0 released Windows binaries don't work #191

Closed pegleGrot closed 5 years ago

pegleGrot commented 5 years ago

The two released Windows xxhsum binaries don't work, in different ways.

The GCC SSE2 build depends on external DLLs and can't run by default. Adding msys-2.0.dll and other related DLLs results in "The application was unable to start correctly (0xc000007b)." Maybe a version mismatch?

The Clang AVX2 build does nominally run but shows no hash or benchmark results, on a non-AVX2 CPU (no AVX2 hardware here). Running with -b shows the following and exits. It doesn't crash, so it's unclear if the behavior is due to missing AVX2.

xxhsum 0.7.0 (64-bits x86_64 + AVX2 little endian), Clang 7.0.1 (tags/RELEASE_701/final), by Yann Collet

Official binaries would be useful to have as a speed reference.

Cyan4973 commented 5 years ago

Thanks for notification @pegleGrot , I've removed the problematic binaries.

In the future, I'll need a way to test produced binaries to ensure they work properly beyond my local Windows VM.

easyaspi314 commented 5 years ago

Once we get my dispatcher patch merged in, we can make it so that XXH3 automagically (or manually) switches between scalar, SSE2, and AVX2 when needed, all in one binary.

We don't want to have two binaries for different microarchitectures in the full release because...

  1. It makes things too complicated. Not only would you likely need a 32-bit and a 64-bit binary, but you would need AVX2 versions of them too. As a result, nobody will use the AVX2 version in production because the maintenance is too much.
  2. Trying to figure out whether your chip supports AVX2 or not is not as simple as it may seem. There is no user agent string, no easy way to get it from the settings, nothing. Unix is fairly easy with /proc/cpuinfo, but on Windows, AFAIK, you have to Google the processor model.
  3. Most of the people in the normal world are too inept to understand the difference between 32-bit and 64-bit, and will probably just choose one randomly if one isn't suggested via stuff like user agent sniffing.
  4. While trying to run a 64-bit binary on 32-bit will not work at all and show a message, running an AVX2 binary on a Sandy Bridge will work until it will randomly crash with an illegal instruction/#UD which seems like a glitch instead of an incompatibility.
pegleGrot commented 5 years ago

Building the MSYS2 build statically seems like the simplest solution.

As for the Clang build, maybe it does work properly when there is AVX2 available? I don't have a good theory for why it both doesn't crash and somehow exits gracefully without doing the actual work. Someone with a suitable CPU should test it, the newest I have here is Ivy Bridge.

@easyaspi314, are people in the normal world the target audience?

easyaspi314 commented 5 years ago

Yes as of right now there is no dispatching in the dev or the 0.7.0 branch. If you try to run the AVX2 build, it expects you to have Haswell and it is currently UB to run on an unsupported device.

And I totally hear you about target audiences. My best devices are a 2011 MacBook Pro with a dead GPU (Sandy Bridge) and an LG G3 from 2014 (Snapdragon 801/Cortex-A15).

If you would like to test the dispatching code, try my branch:

git clone -b multitarget https://github.com/easyaspi314/xxhash xxhash-multi
cd xxhash-multi
make MULTI_TARGET=1

If you want to compile with msvc, try these commands

cl.exe -O2 -c -DXXH_MULTI_TARGET=1 xxhash.c
cl.exe -O2 -c -DXXH_MULTI_TARGET=1 xxhsum.c
cl.exe -O2 -c -DXXH_MULTI_TARGET=1 -Foxxh3-scalar.obj xxh3-target.c -DXXH_VECTOR=0
cl.exe -O2 -c -DXXH_MULTI_TARGET=1 -Foxxh3-sse2.obj xxh3-target.c -DXXH_VECTOR=1 -arch:SSE2
cl.exe -O2 -c -DXXH_MULTI_TARGET=1 -Foxxh3-avx2.obj xxh3-target.c -DXXH_VECTOR=2 -arch:AVX2
cl.exe -O2 xxhsum.obj xxhash.obj xxh3-scalar.obj xxh3-sse2.obj xxh3-avx2.obj
pegleGrot commented 5 years ago

I was looking for reference Win builds because of unexpected low speed for XXH64 x86.

As suspected, turns out it's MSVC (only tried a few combinations with /O2, maybe solvable with more tinkering?). On Ivy Bridge, a GCC8/MinGW build from easyaspi314 (thanks) is more than twice as quick, which is in line with Cyan4973's GCC4.8 results.

With the exception of XXH64/x86, the other hash/arch combos (XXH64 x64, XXH32 x86/x64) perform practically the same on both compilers.

Curiously, Cyan4973's GCC4.8 results show XXH32 x64 as quicker than x86. I don't see that, both with VC19 and easyaspi's GCC8. Also Cyan4973's XXH64/x64 numbers, normalized, are quicker.


@easyaspi314, your multi-target changes work okay. Tried it on Ivy Bridge (SSE2), late-era Athlon 64 (SSE2), and Pentium 3 (an older build of yours used __XXH3_HASH_LONG_SCALAR, a local build from newer code showed results for both the unnamed XXH3 variant, and "Scalar").

XXH3 is still in flux so it probably doesn't matter, but a small problem: in xxh3.h VC warns (C4244) about a narrowing conversion in the lines (642-643) that assign XXH_mult32to64() to U64.

easyaspi314 commented 5 years ago

Oh sweet, the Pentium III code works! I was trying to test it but I couldn't find the cord to my ancient laptop. Thanks for testing that. Hooray for incredibly excessive backwards compatibility!

I am currently trying to figure out what is going on with the performance on MSVC. I managed to make things faster with some tweaks such as using a temp buffer, however, there is a notable slowdown with both the MSVC codegen and, oddly clang-cl when compared to the same version of Clang for MinGW, WSL, or Cygwin.

Check out my notes in the PR and my recent commit messages.

If you didn't already, git pull my latest changes which also adds CMake support for dispatching (cmake -DMULTI_TARGET=1) and fixes a lot of issues with Windows support.

However, I am not giving MSVC an ounce of respect until it gets at least 10 GB/s as it is literally intrinsics.

Depending on the CPU it can be faster or slower. For example, older Intel chips get terrible XXH32 performance because of the slow imul instruction (On Prescott, imul takes more cycles than pmuludq for no good reason).

Additionally, the target CPU changes a lot of things with Clang because Intel can't pick a fast rotate instruction and stick with it. Older chips prefer rol, Sandy and Ivy prefer shld, and Haswell prefers rorx. Also, Clang tends to mess up the XXH32 and XXH64 loops by adding extra register swaps for x86/x86_64. Especially since the normal XXH32 implementation uses the exact same instructions on x86 and x86_64:

    imul    ebx, dword ptr [edi], -2048144777
    add     eax, ebx
    rol     eax, 13
    imul    eax, eax, -1640531535

Meanwhile XXH64 needs something different and each iteration requires two emulated 64-bit multiplies on 32-bit.

Depending on the chip, XXH64 can be made faster with different SSE2/SSE4/NEON routines for 32-bit (including a really fast SSE4 one for Nehalem to Ivy Bridge ONLY because of the temporary god mode pmulld before the 5 extra cycle nerf in Haswell), but...

  1. It only benefits 32-bit.
  2. The three routines are entirely different
  3. The routines are very long as they not only require two multiplies, but the optimal NEON path requires two different multiply routines (because of the free vld2.32)
  4. The dispatch logic is ridiculous:
if (32-bit and SSE4 and between Nehalem and Ivy Bridge)
    XXH64_32_SSE4_Nehalem();
else if (32-bit and SSE2)
    XXH64_32_SSE2();
else if (32-bit and NEON)
    XXH64_NEON32();
else
    XXH64_Scalar();
pegleGrot commented 5 years ago

No need to worry about ARM in the x86 builds. :)

Bulat-Ziganshin commented 5 years ago

If you need portable x86 features detector, you may look at https://github.com/Bulat-Ziganshin/FARSH/blob/master/benchmark/CpuID.h

tansy commented 5 years ago

I built win32/64 binaries with mingw-gcc from regular v0.7.0 release, without msys*dll dependencies (except msvcrt.dll). Tested them for compliance and they work fine. Use them if you want. From my tests they are slightly faster than of newer compilers.

Cyan4973 commented 5 years ago

v0.7.1 includes new Windows binaries, which were compiled with -static to avoid any dependency. Tested in a standard cmd shell.

pegleGrot commented 5 years ago

Thanks Cyan and tansy.