Cyan4973 / xxHash

Extremely fast non-cryptographic hash algorithm
http://www.xxhash.com/
Other
8.96k stars 768 forks source link

Update README.md and benchmarks #291

Closed easyaspi314 closed 4 years ago

easyaspi314 commented 4 years ago

Currently, the benchmarks and SMHasher scores shown in README.md and xxhash.h are out of date. They should be updated, as XXH32 fails SMHasher and they don't include the newer hashes.

easyaspi314 commented 4 years ago

@Cyan4973 Do you think you can update the tables before the next release?

Cyan4973 commented 4 years ago

Likely, yes. But I'm not sure what's best, to update the README now, or is it better to wait for v0.8.0.

In the meantime, I've also setup https://github.com/Cyan4973/xxHash/wiki/Performance-comparison .

The main idea is that the README will only be able to present a few numbers, a sort of very summarized view, while more involved comparisons with more figures will be performed in the wiki.

easyaspi314 commented 4 years ago

I'm fine with waiting until v0.8.0.

Also, are these the latest graphs or just for v0.7.0?

Cyan4973 commented 4 years ago

Not sure if it's v0.7.0, but it's definitely not latest version. Yep, to update too.

easyaspi314 commented 4 years ago

When testing, can you make sure to use the -mno-avx256-split-unaligned-load flag on GCC+AVX2? That will give a more fair benchmark to GCC.

Cyan4973 commented 4 years ago

Sure !

Cyan4973 commented 4 years ago

I tried it, in mingw64, and something very strange happened : the seeded variant of xxh3 ended up being much faster than the non-seeded one ....

gcc + nosplit + avx2, unseeded : 34 GB/s gcc + nosplit + avx2, seeded : 45 GB/s clang + avx2, both : 46 GB/s

That's unexpected. The unseeded variant is supposed to be the faster one ...

If anything, the seeded variant checks if the seed is 0, and in this case, defers to the unseeded variant, so it should be equivalent. However, the benchmark actually changes the seed at each iteration, so the seeded variant cannot take advantage of this shortcut.

Still, it doesn't explain why it's faster. The seeded variant is supposed to build a new secret at runtime, and then pass the newly created secret as argument to the hot loop. If anything, this is more work compared to the unseeded variant, which just pass the pre-computed secret without any runtime effort.

easyaspi314 commented 4 years ago

I guess that GCC tried to constant fold kSecret and it failed miserably. I'll have to check the assembly.

easyaspi314 commented 4 years ago

I'm getting it slightly slower (58.9/60) but not the amount that you are experiencing. This was done on mingw64 with GCC 9.2.0.

xxhsum.exe 0.7.3 (64-bit x86_64 + AVX2 little endian), GCC 9.2.0, by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    82306 it/s ( 8037.7 MB/s)
XXH32 unaligned          :     102400 ->    81725 it/s ( 7981.0 MB/s)
XXH64                    :     102400 ->   164036 it/s (16019.2 MB/s)
XXH64 unaligned          :     102400 ->   163824 it/s (15998.4 MB/s)
XXH3_64b                 :     102400 ->   603802 it/s (58965.1 MB/s)
XXH3_64b unaligned       :     102400 ->   491521 it/s (48000.1 MB/s)
XXH3_64b seeded          :     102400 ->   614401 it/s (60000.1 MB/s)
XXH3_64b seeded unaligne :     102400 ->   493891 it/s (48231.5 MB/s)
XXH128                   :     102400 ->   574206 it/s (56074.8 MB/s)
XXH128 unaligned         :     102400 ->   489953 it/s (47847.0 MB/s)
XXH128 seeded            :     102400 ->   579336 it/s (56575.8 MB/s)
XXH128 seeded unaligned  :     102400 ->   496284 it/s (48465.3 MB/s)

What GCC version do you have? And can you run gcc -S -masm=intel (flags) xxhash.c -o xxhash.s and then give me xxhash.s?

Cyan4973 commented 4 years ago

I'm using the same version of gcc, v9.2.0, on mingw64 : gcc version 9.2.0 (Rev2, Built by MSYS2 project)

Also note that I'm compiling with -mno-avx256-split-unaligned-load as you suggested. Without it, both variants are down to 30 GB/s.

Note sure if that makes sense, but I found a direct link with the XXH_align() statement in front the stack area reserved for the custom secret.

In current code, this area is aligned on 8 bytes (not sure why). If it's aligned on 8 or 16 bytes, then the speed of seeded variant gets a boost. If it's aligned on 32 or 64 bytes, then the speed goes down, to the level of the unseeded variant.

This goes contrary to expectation. It's as if avx2 unaligned load is faster than aligned one.

Note that kSecret is aligned on 64 bytes, but even when I change that statement, performance doesn't move.

easyaspi314 commented 4 years ago

I am seeing slight variation on the alignment, but not much.

Does removing the optimize pragmas and using -O3 help?

Cyan4973 commented 4 years ago

can you run gcc -S -masm=intel (flags) xxhash.c -o xxhash.s and then give me xxhash.s?

https://gist.github.com/Cyan4973/6266189692549d2fa9deab7cc086ccfb

more verbose version :

https://gist.github.com/Cyan4973/bf74a910f1906e16703495cc9f064728

Cyan4973 commented 4 years ago

I am seeing slight variation on the alignment, but not much.

I'm wondering if this could be a cpu specific effect...

easyaspi314 commented 4 years ago

I'm getting the same speeds, so it might be cpu-specific.

As for the verbose version, ow my eyes :joy:

sergeevabc commented 4 years ago

Err… @Cyan4973, why 0.7.3 SSE2 executes silently on Windows 7 x64? Created #343.

easyaspi314 commented 4 years ago

Damn, this better not be another Unicode wrapper bug.......

stati64 commented 4 years ago

Visual Studio implies SSE2 for x64 compiles

easyaspi314 commented 4 years ago

Well, of course it would. SSE2 is required for x86_64. That was one of the reasons we chose SSE2 (and NEON), no compat checks.

It also assumes SSE2 on x86 because Windows 7 and later require it.

Cyan4973 commented 4 years ago

For the time being, performance comparison is updated in : https://github.com/Cyan4973/xxHash/wiki/Performance-comparison

On reaching v0.8.0, parts of the wiki will be used to update the performance section within README.md.

Cyan4973 commented 4 years ago

README.md has been updated with recent benchmark measurements, extracted from the wiki