Closed easyaspi314 closed 4 years ago
@Cyan4973 Do you think you can update the tables before the next release?
Likely, yes.
But I'm not sure what's best, to update the README now, or is it better to wait for v0.8.0
.
In the meantime, I've also setup https://github.com/Cyan4973/xxHash/wiki/Performance-comparison .
The main idea is that the README will only be able to present a few numbers, a sort of very summarized view, while more involved comparisons with more figures will be performed in the wiki.
I'm fine with waiting until v0.8.0
.
Also, are these the latest graphs or just for v0.7.0
?
Not sure if it's v0.7.0
, but it's definitely not latest version.
Yep, to update too.
When testing, can you make sure to use the -mno-avx256-split-unaligned-load
flag on GCC+AVX2? That will give a more fair benchmark to GCC.
Sure !
I tried it, in mingw64
, and something very strange happened :
the seeded
variant of xxh3
ended up being much faster than the non-seeded one ....
gcc
+ nosplit + avx2
, unseeded : 34 GB/s
gcc
+ nosplit + avx2
, seeded : 45 GB/s
clang
+ avx2
, both : 46 GB/s
That's unexpected. The unseeded
variant is supposed to be the faster one ...
If anything, the seeded
variant checks if the seed is 0
, and in this case, defers to the unseeded
variant, so it should be equivalent.
However, the benchmark actually changes the seed at each iteration, so the seeded
variant cannot take advantage of this shortcut.
Still, it doesn't explain why it's faster.
The seeded
variant is supposed to build a new secret
at runtime, and then pass the newly created secret as argument to the hot loop. If anything, this is more work compared to the unseeded
variant, which just pass the pre-computed secret
without any runtime effort.
I guess that GCC tried to constant fold kSecret
and it failed miserably. I'll have to check the assembly.
I'm getting it slightly slower (58.9/60) but not the amount that you are experiencing. This was done on mingw64 with GCC 9.2.0.
xxhsum.exe 0.7.3 (64-bit x86_64 + AVX2 little endian), GCC 9.2.0, by Yann Collet
Sample of 100 KB...
XXH32 : 102400 -> 82306 it/s ( 8037.7 MB/s)
XXH32 unaligned : 102400 -> 81725 it/s ( 7981.0 MB/s)
XXH64 : 102400 -> 164036 it/s (16019.2 MB/s)
XXH64 unaligned : 102400 -> 163824 it/s (15998.4 MB/s)
XXH3_64b : 102400 -> 603802 it/s (58965.1 MB/s)
XXH3_64b unaligned : 102400 -> 491521 it/s (48000.1 MB/s)
XXH3_64b seeded : 102400 -> 614401 it/s (60000.1 MB/s)
XXH3_64b seeded unaligne : 102400 -> 493891 it/s (48231.5 MB/s)
XXH128 : 102400 -> 574206 it/s (56074.8 MB/s)
XXH128 unaligned : 102400 -> 489953 it/s (47847.0 MB/s)
XXH128 seeded : 102400 -> 579336 it/s (56575.8 MB/s)
XXH128 seeded unaligned : 102400 -> 496284 it/s (48465.3 MB/s)
What GCC version do you have? And can you run gcc -S -masm=intel (flags) xxhash.c -o xxhash.s
and then give me xxhash.s?
I'm using the same version of gcc, v9.2.0
, on mingw64
:
gcc version 9.2.0 (Rev2, Built by MSYS2 project)
Also note that I'm compiling with -mno-avx256-split-unaligned-load
as you suggested. Without it, both variants are down to 30 GB/s.
Note sure if that makes sense, but I found a direct link with the XXH_align()
statement in front the stack area reserved for the custom secret
.
In current code, this area is aligned on 8 bytes (not sure why). If it's aligned on 8 or 16 bytes, then the speed of seeded
variant gets a boost. If it's aligned on 32
or 64
bytes, then the speed goes down, to the level of the unseeded variant.
This goes contrary to expectation. It's as if avx2
unaligned load is faster than aligned one.
Note that kSecret
is aligned on 64
bytes, but even when I change that statement, performance doesn't move.
I am seeing slight variation on the alignment, but not much.
Does removing the optimize pragmas and using -O3
help?
can you run gcc -S -masm=intel (flags) xxhash.c -o xxhash.s and then give me xxhash.s?
https://gist.github.com/Cyan4973/6266189692549d2fa9deab7cc086ccfb
more verbose version :
https://gist.github.com/Cyan4973/bf74a910f1906e16703495cc9f064728
I am seeing slight variation on the alignment, but not much.
I'm wondering if this could be a cpu specific effect...
I'm getting the same speeds, so it might be cpu-specific.
As for the verbose version, ow my eyes :joy:
Err… @Cyan4973, why 0.7.3 SSE2 executes silently on Windows 7 x64? Created #343.
Damn, this better not be another Unicode wrapper bug.......
Visual Studio implies SSE2 for x64 compiles
Well, of course it would. SSE2 is required for x86_64. That was one of the reasons we chose SSE2 (and NEON), no compat checks.
It also assumes SSE2 on x86 because Windows 7 and later require it.
For the time being, performance comparison is updated in : https://github.com/Cyan4973/xxHash/wiki/Performance-comparison
On reaching v0.8.0
, parts of the wiki will be used to update the performance section within README.md
.
Currently, the benchmarks and SMHasher scores shown in README.md and xxhash.h are out of date. They should be updated, as XXH32 fails SMHasher and they don't include the newer hashes.