JIABI / xxhash

Automatically exported from code.google.com/p/xxhash
Other
0 stars 0 forks source link

Performance problem on ARM #28

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Hello,

My benchmark tool: 
https://github.com/alphaonex86/CatchChallenger/tree/version-2/tools/xxhashbenchm
ark
My result:
http://catchchallenger.first-world.info/wiki/Benchmark_for_conception#Hash_and_c
ontrol_sum
As you can see it's very powerful on x86. But on ARM Cortex A9 it's very slow, 
slow as sha224!
Any idea of this performance problem?

Cheers,

Original issue reported on code.google.com by brule.he...@gmail.com on 11 Aug 2015 at 3:42

GoogleCodeExporter commented 8 years ago
Thanks for your feedback.

There are a number of possibilities.

First, it seems you are using a 32-bits CPU/OS.
In such case, 64-bits arithmetic is going to be slow.
xxHash32 is meant to cover such situation : it's much faster on 32-bits CPU.
Obviously, its spreading capability is limited (32-bits vs 64-bits), so 
depending on your use case, it can be acceptable or not.

As a work around, in case the difference between 32-bits & 64-bits is very 
large (>2x), you can still consider using xxHash32 twice with 2 different seeds.

Next potential performance impact : aligned vs unaligned load. Especially on 
ARM, and especially for 64-bits loads.
But let's move the needle one step after another.

Original comment by yann.col...@gmail.com on 11 Aug 2015 at 3:58

GoogleCodeExporter commented 8 years ago
Hello,

I use XXH32() on ARM Cortex A9 then 32Bits cpus (and OS). XXH32() is well 
32Bits version? xxHash32 not found into the header 
https://github.com/alphaonex86/CatchChallenger/blob/version-2/tools/xxhashbenchm
ark/xxhash.h
My actual use is sha224 4 first byte, I need be same hash on all cpu/OS to 
compare remotely some file. Then xxHash32 is good, ins't it?

My benchmark on aligned/unaligned access: 
http://catchchallenger.first-world.info/wiki/Benchmark_for_conception#memcpy_vs_
raw

My xxhash.* is extracted from last lz4 source.

Cheers,

Original comment by brule.he...@gmail.com on 11 Aug 2015 at 4:22

GoogleCodeExporter commented 8 years ago
I detect the changed file to keep sync a folder of file. 1/2^32 is acceptable 
collision luck for me.

Original comment by brule.he...@gmail.com on 11 Aug 2015 at 4:25

GoogleCodeExporter commented 8 years ago
> I detect the changed file to keep sync a folder of file. 1/2^32 is acceptable 
collision luck for me.
> My actual use is sha224 4 first byte, I need be same hash on all cpu/OS to 
compare remotely some file. Then xxHash32 is good, ins't it?

Yes, sounds totally fine. xxHash is portable and endian independent, so it 
shall give the same answer of any plarform.

> XXH32() is well 32Bits version?

Yes, it's the right function.
If that's the one you have already benchmarked, then we need to find another 
performance bottleneck.

Possibility n2 : raw memory performance.
In some cases (embedded systems), accessing raw memory is simply very slow. Any 
access outside of the cache takes a huge performance toll.

Difficult to say from your results. You mention "35kH/s". I suspect it means 
35K hashes per second. But than what is hashed ? And especially what is its 
size ?

For example, maybe your tested sample nicely fits into an Intel L2 cache, but 
not within the ARM cache ? In such case, the difference could be explained just 
by architecture.

Original comment by yann.col...@gmail.com on 11 Aug 2015 at 4:33

GoogleCodeExporter commented 8 years ago
> xxHash is portable and endian independent
Where found it? I have already hardware -> little endian function coded if 
needed.

Yes 35K hash by second, on sha224 is done via the command "cryptsetup 
benchmark" and xxhash on 4096 data block (fit in L1). See:
https://github.com/alphaonex86/CatchChallenger/blob/version-2/tools/xxhashbenchm
ark/main.cpp#L8
hdparm -T /dev/zero give 19471MB/sec on my x86, 727MB/sec on my ARM.

Grsec (Mostly PaX) slow down the memory access, well then but why it's at same 
level than sha224 then? The memory allocator can change all, if memory 
allocation/unallocation is happen it's maybe this.

Original comment by brule.he...@gmail.com on 11 Aug 2015 at 5:04

GoogleCodeExporter commented 8 years ago
Other diff, on my x86 the linux is configured under desktop mode, on ARM it's 
configured on server mode (scheduler configured into the kernel).
The ARM is more slow, but only with xxhash I have this too many slow down.

Original comment by brule.he...@gmail.com on 11 Aug 2015 at 5:11

GoogleCodeExporter commented 8 years ago
OK, so the next candidate is aligned vs unaligned access.

This is only meaningfull if you feel that, in your application, data start is 
going to be aligned on 4-bytes boundaries in most circumstances.
(This will be the case is you malloc() a buffer a just load your file into it)

To ensure this property, prefer :
unsigned int inputdata[1024];
in your test program.

This should help, but there is no guarantee.

The critical point is to ensure that the correct branch is taken by the code 
(which require some debugging capability), 
especially within XXH32() :

if ((((size_t)input) & 3) == 0)  // <===== Ensure this branch exists and its 
answer is "yes"
    {
        if ((endian_detected==XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT)
            return XXH32_endian_align(input, len, seed, XXH_littleEndian, XXH_aligned);
        else
            return XXH32_endian_align(input, len, seed, XXH_bigEndian, XXH_aligned);
    }

Original comment by yann.col...@gmail.com on 11 Aug 2015 at 5:23

GoogleCodeExporter commented 8 years ago
9090H/s with unsigned int inputdata[1024];

defined(XXH_USE_UNALIGNED_ACCESS) then never go into this section.
The input will always aligned in my case (benchmark as real case).

Original comment by brule.he...@gmail.com on 11 Aug 2015 at 5:44

GoogleCodeExporter commented 8 years ago
On Rpi 1, 13000H/s no grsec/PaX. 33000H/s for sha256.
Then worse than my previous ARM. XXH_USE_UNALIGNED_ACCESS set too.

Original comment by brule.he...@gmail.com on 11 Aug 2015 at 5:55

GoogleCodeExporter commented 8 years ago
> 9090H/s with unsigned int inputdata[1024];

That looks like a slow down ?

> defined(XXH_USE_UNALIGNED_ACCESS) then never go into this section.

the objective is to *not use unaligned access*, so make sure it is *not* 
defined !
The code should reach this variant :
return XXH32_endian_align(input, len, seed, XXH_littleEndian, XXH_aligned);

In theory, the code should _automatically_ detect that its input is aligned, 
and use this opportunity to request direct 32-bits memory accesses, without all 
the precautionary code specific to CPU with poor unaligned accesses.

Original comment by yann.col...@gmail.com on 11 Aug 2015 at 5:59

GoogleCodeExporter commented 8 years ago
Yes, seam slow down.

I have undefined: XXH_USE_UNALIGNED_ACCESS
35000H/s
For #define XXH_FORCE_NATIVE_FORMAT 1:
200000H/s
Why it's dop dynamicly? Why it's not do statically?
Look at: 
https://github.com/alphaonex86/CatchChallenger/blob/version-2/general/base/Porta
bleEndian.h

Original comment by brule.he...@gmail.com on 11 Aug 2015 at 6:12

GoogleCodeExporter commented 8 years ago
if ((((size_t)input) & 3) == 0) this is true, then aligned.
And endian_detected==XXH_littleEndian is true

Original comment by brule.he...@gmail.com on 11 Aug 2015 at 6:20

GoogleCodeExporter commented 8 years ago
That was my next candidate ...

The endian detection code is supposed to be summarized to a single test, at the 
beginning of the function :

XXH_endianess endian_detected = (XXH_endianess)XXH_CPU_LITTLE_ENDIAN;

which then branch to the correct version of the algorithm. In many cases, the 
test itself is "hard-coded". But even if it is not, it should be done only 
once, and then propagated.
This is only true if the compiler does a good enough job at inlining. Hence the 
FORCE_INLINE within :

FORCE_INLINE U32 XXH32_endian_align()

something which works well on all compilers I could test on, but is difficult 
to guarantee on every compiler and every platform.

I also used to use static define for such detection, but I was keeping 
receiving reports for unforeseen platforms (CPU +OS) which were either 
undetected or, worse, wrongly detected. I therefore switched to dynamic, with 
the assumption that the compiler should do a good enough job at simplifying 
this test to a clean 0/1 result, and then removing branches wherever needed.

Original comment by yann.col...@gmail.com on 11 Aug 2015 at 6:24

GoogleCodeExporter commented 8 years ago
Here is gcc 4.8, then very common and updated compiler. As developper you need 
help few bit the compiler some time ;)

No detected is not a problem. Wrongly detect it's big problem, then be more 
strict on filter can help: only known config is forced as static: gcc 4.7+ and 
x86, and ARM, ...

Dynamic implies will greatly depend of the instruction pipeline more than the 
compiler, the compiler is more here to to the static part.
As you have see on rpi it's great regression, I think it should be very worse 
on other single issue, look:
Rpi1: Single-issue, 2.5x more slow
Cortex A9: Partial dual-issue, in-order microarchitecture with an 8-stage 
pipeline, 1x, same speed
Cortex A7: Partial dual-issue, in-order microarchitecture with an 8-stage 
pipeline, xxHash: 13849H/s, Sha256: 59578H/s
x86 Intel haswell: Multple issue, 14- to 19-stage instruction pipeline: 2.6x 
MORE FAST than accelerated sha2 hardware (already fast).
And I'm sure worse on:
Cortex A5: Single-issue, in-order microarchitecture with an 8-stage pipeline
Or AMD geode. I don't have time to test that's.
The inline dynamic is 14x more slow in debug than normal compilation (-02) on 
gcc.
The main target of xxHash is to be more fast than other, mostly sha256 which is 
true crytographic hash.

I can provide you a virtual machine arm test in private if you want.

Original comment by brule.he...@gmail.com on 11 Aug 2015 at 7:47

GoogleCodeExporter commented 8 years ago
FORCE_INLINE U32 XXH32_endian_align(const void* input, size_t len, U32 seed, 
XXH_endianess endian, XXH_alignment align) -> another code problem:
while(){if} will ever more slow than if{while1}else{while2}
Lot of condition into the while should be put out to improve performance where 
the branch prediction hardware unit is poor/slow.

Original comment by brule.he...@gmail.com on 12 Aug 2015 at 8:27

GoogleCodeExporter commented 8 years ago
The last tip provide a large boost 8000->85800 then 1072% more speed on rpi1.
Drop XXH_endianess endian, XXH_alignment align on define 
XXH_USE_UNALIGNED_ACCESS or/and XXH_FORCE_NATIVE_FORMAT will help the compiler 
to minimize the pressure on stack/register on generic way to have more 
performance one all architecture.

Original comment by brule.he...@gmail.com on 12 Aug 2015 at 8:50

GoogleCodeExporter commented 8 years ago
These are very good hints.
Indeed, having direct access to real hardware for tests
is key to observe such effects.

> The inline dynamic is 14x more slow in debug than normal compilation (-02) on 
gcc.

Performance in debug mode does not matter. Debug mode is only there to observe 
code correctness. It's not indicative of final performance.

What matters is that these "dynamic" tests get simplified to static 
compile-time 0/1, and then the relevant branches shall be removed during 
compilation (dead code elimination), producing a clean straightforward code.

As stated, it depends a lot on compiler's optimizations. gcc 4.8 should be good 
enough for this, although maybe its final effectiveness vary depending on final 
hardware.

Minor point : note that I'm compiling with -O3 rather than -O2. I'm not 
expecting  large difference, but maybe in this case ...

> I can provide you a virtual machine arm test in private if you want.

Yes, an SSH term to such system would be welcomed.

Original comment by yann.col...@gmail.com on 12 Aug 2015 at 10:17

GoogleCodeExporter commented 8 years ago
In can open you an access on real hardware, I have rpi1, cubieboard 2, 
odroid-u3, cubox-i, ... see more at 
http://catchchallenger.first-world.info/wiki/Portable_datacenter

In some part you are in right, but if you have way to improve the performance 
in all case include without strong optimization -02 (loop interchange, ...), I 
think it's better. Ever more to profile into debug mode.

The world have so many compiler, OS and hardware. This simply thread show this 
theory of the compiler will do the job is wrong. But in this case it's for me, 
due to the code not the compiler. The compiler too need respect the code 
written.
My previous comment show something, the code is bad written, we don't have to 
hope for the compiler but fix the code.

I will send in private way the pass, or send my your ssh public key:
http://ultracopier.first-world.info/contact.html

Original comment by brule.he...@gmail.com on 12 Aug 2015 at 11:13

GoogleCodeExporter commented 8 years ago
Some useful link:
https://en.wikipedia.org/wiki/Loop_optimization
;)

Original comment by brule.he...@gmail.com on 12 Aug 2015 at 11:35

GoogleCodeExporter commented 8 years ago
Interesting note:
* gzip -1 vs lz4 -1 on x86: lz4 6.2x more fast
* gzip -1 vs lz4 -1 on ARM: lz4 3.6x more fast 

Original comment by brule.he...@gmail.com on 12 Aug 2015 at 2:16

GoogleCodeExporter commented 8 years ago
That's why I'm interested in looking into such issues ;)

Original comment by yann.col...@gmail.com on 12 Aug 2015 at 2:21

GoogleCodeExporter commented 8 years ago
The ssh connexion works.

I've downloaded xxHash from github, as a comparison exercise on target system.

cat /proc/cpuinfo
processor   : 0
model name  : ARMv6-compatible processor rev 7 (v6l)
BogoMIPS    : 2.00
Features    : half thumb fastmult vfp edsp java tls 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part    : 0xb76
CPU revision    : 7

Hardware    : BCM2708
Revision    : 0002

Right from the box, using r40 and xxhsum internal memory benchmark, I got the 
following figures :

XXH32            :       7520 ->   267.0 MB/s   0x621B687C                     
XXH32 (unaligned :       7519 ->   242.6 MB/s                                  
XXH64            :       7520 ->   113.0 MB/s   0xC256B0CD7B86609B 

Keeping in mind your own test is on 4KB memory segment, this translates into 
roughly 66 kH/s.
This is less than 86 kH/s as reported by your benchmark tool, but not vastly 
below either.

As a test, I compiled xxhsum using -O2 instead (default Makefile uses -O3). 
Here are the results :

XXH32            :       7520 ->    73.7 MB/s   0x621B687C                     
XXH32 (unaligned :       7519 ->    71.8 MB/s                                  
XXH64            :       7520 ->    85.0 MB/s   0xC256B0CD7B86609B         

which is significantly worse. The difference on x86/x64 platforms is much less 
pronounced, so this impact is related to this platform/compiler combination (As 
a sidenote, I see installed GCC version is v4.7.4).

This translates into ~18 kH/s, which might be closer to what you were 
experiencing at the beginning.

I also re-used the xxhash.c version stored at root directory, and compiled a 
new xxhsum with it. I expected it would feature the modifications you were 
talking about, and corresponding performance improvements.

But I'm not sure it does, since performance is identical to vanilla r40, on 
both -O3 and -O2 settings.
This is in contrast with your test program, xxhashbenchmark, which displays :

Result: 200000 in 2324ms
Mean:86058H/s
XXH_USE_UNALIGNED_ACCESS

I was expecting it to use your own xxHash variant to reach such figure.

Original comment by yann.col...@gmail.com on 12 Aug 2015 at 3:12

GoogleCodeExporter commented 8 years ago
New test :

Using vanilla xxhash r40,
but making sure that XXH_USE_UNALIGNED_ACCESS is disabled.
(just comment the line :
//#  define XXH_USE_UNALIGNED_ACCESS 1)
xxhsum internal benchmark gives :

XXH32            :       7520 ->   **419.6 MB/s**   0x621B687C                  

XXH32 (unaligned :       7519 ->   242.5 MB/s                                  
XXH64            :       7520 ->   121.7 MB/s   0xC256B0CD7B86609B    

This is a great improvement, although only for 32-bits aligned access.
**420 MB/s** roughly translates into 105 kH/s, so that's even better than 
previous result from xxhashbenchmark.

It really shows that precautionary code is being used when accessing 
non-aligned data, and that precautionary code cost some performance.

As a side exercice, I also tried to compile xxhsum using -O2 and here are the 
results :

XXH32            :       7520 ->   **419.6 MB/s**   0x621B687C                  

XXH32 (unaligned :       7519 ->    72.2 MB/s                                  
XXH64            :       7520 ->   121.6 MB/s   0xC256B0CD7B86609B      

Interestingly, performance on aligned data is about the same as -O3.

So basically, it seems to be the single point of improvement : make sure this 
define is not set. 

I might as well remove it in a future revision...

Original comment by yann.col...@gmail.com on 12 Aug 2015 at 3:14

GoogleCodeExporter commented 8 years ago
Another test :

I forced usage of direct 32-bits accesses even on non-aligned data (it requires 
to pretend that data is aligned even when it's not). Here are the results :

XXH32            :       7520 ->   404.6 MB/s   0x621B687C                     
XXH32 (unaligned :       7519 ->   __387.5 MB/s__
XXH64            :       7520 ->   121.6 MB/s   0xC256B0CD7B86609B

Interestingly enough , __ -O2 compilation settings produce same speed _ .

This is a lot better for 32-bits unaligned access !
That means this cpu is able to efficiently access unaligned memory.

This is typical of problems I've got with ARM :
Some CPU are unable to access unaligned memory, and crash on such attempts.
Others are able to access, but have poor performance.
Sometimes, performance is good for 32-bits, but poor for 64-bits.
And finally, some are simply very good to handle unaligned memory accesses. 
Almost Intel-like (which is best in class).

Handling all these use-cases is difficult, and a gigantic set of #define would 
be required. Even in such cases, there are too many sub-families and specific 
behavior, so it woud never be generic enough.

What is supposed to be generic is to trust the compiler : tell it to access 
memory, and let it choose how to do it best for the target cpu.

The way to do it is typically :

memcpy(memPtr, &value, sizeof(value));

On x86/x64, it works, and this function call is translated into a single mov 
instructions, since Intel cpu correctly and efficiently handle unaligned memory 
accesses.

But for ARM, no such outcome. It turns out, compiler is in fact very cautious. 
I guess that's because it doesn't know that well the specificities of each CPU, 
and therefore tend to lean on the safe side. Compiled binary works, it's just 
vastly sub-optimal.

The only way I found to counter-balance this problem is to add a manual switch 
(#define), to force the compiler to issue direct instructions.

This method has 2 drawbacks :

- Most people don't realize that the switch is present. They just compile the 
code "as is". So any such switch tend to be overlooked.

- Forcing the use of direct access can have nasty side-effects on some advanced 
optimizations. For example, on x64, newer compilers (GCC 5+) try to 
auto-vectorize some portions of code. Direct access is then assumed to be a 
guarantee of being aligned. If it's not, it results in buggy binaries.

So basically, this switch must be manual, so that the user take full 
responsibility for this choice. If it crashes, it's enough to revert it. But I 
can't have "default" switches which may produce buggy binaries.

Anyway, one big issue observed here is that XXH_USE_UNALIGNED_ACCESS switch 
does the reverse of what it's supposed to do => it slows down memory accesses 
on CPU which can efficiently handle it ! This is the result of several changes, 
with no way to check the changes on real hardware. That's enough a conclusion 
to look for a correction.

Original comment by yann.col...@gmail.com on 12 Aug 2015 at 3:19

GoogleCodeExporter commented 8 years ago
-O2 and -O3 do a big change? If yes the C code need be changed to improve the 
performance on most common program packaging lz4. Analising the generated code 
can be useful.

Original comment by brule.he...@gmail.com on 12 Aug 2015 at 3:39

GoogleCodeExporter commented 8 years ago
-O2 and -O3 have identical performance with direct memory accesses.

Differences exist when the compiler generates "cautious" code (for unaligned 
memory accesses).

Original comment by yann.col...@gmail.com on 12 Aug 2015 at 3:52

GoogleCodeExporter commented 8 years ago
Latest version of xxHash within "dev" branch seems to improve this situation, 
by a fairly large margin :

https://github.com/Cyan4973/xxHash/tree/dev

To test it, you can either sync your git directory and get into dev branch, or 
download the automated github archive : 
https://github.com/Cyan4973/xxHash/archive/dev.zip

This version offers same performance at -O2 and -O3 levels on ARM rpi target.
It seems to enable automatically direct unaligned memory accesses, meaning 
performance is good for both cases (aligned & unaligned) :
XXH32            :     102400 ->   406.7 MB/s   0x955F2650   (for both -O2 and 
-O3)                     
XXH32 (unaligned :     102399 ->   392.0 MB/s                (for both -O2 and 
-O3)                

Otherwise, I noticed that when this option is manually turned off (by 
commenting XXH_FORCE_DIRECT_MEMORY_ACCESS),
aligned performance is slightly better, while unaligned one is vastly worse :
XXH32            :     102400 ->   422.6 MB/s   0x955F2650   (for both -O2 and 
-O3)     
XXH32 (unaligned :     102399 ->   244.3 MB/s   ( -O3 )                         

XXH32 (unaligned :     102399 ->    72.4 MB/s   ( -O2 )                      

Not sure why, and not sure if the tiny gain for the aligned memory case is 
worth the large drop for unaligned one.

test system specs :
lscpu
Architecture:          armv6l
Byte Order:            Little Endian
CPU(s):                1
On-line CPU(s) list:   0
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             1
Model name:            ARMv6-compatible processor rev 7 (v6l)
CPU max MHz:           700.0000
CPU min MHz:           700.0000

Original comment by yann.col...@gmail.com on 12 Aug 2015 at 5:33

GoogleCodeExporter commented 8 years ago
For integration into program good performance is needed is all case, if some 
time is very better than sha256, other time very worse I prefer use sha256 into 
my program because more and more the cpu will be helped by crytographic cpu.

Difference between -O2 and -O3 can be see by: gcc -Q -O2 --help=optimizers and 
gcc -Q -O3 --help=optimizers
On this gcc under rpi is: -fgcse-after-reload -finline-functions -fipa-cp-clone 
-fpredictive-commoning -ftree-loop-distribute-patterns -ftree-vectorize 
-funswitch-loops is enable into O3.
Mean it's possible to force it by changing the code. We need determine too what 
flags have what influencie.

Original comment by brule.he...@gmail.com on 12 Aug 2015 at 5:57

GoogleCodeExporter commented 8 years ago
Have you been able to test the newer xxHash version ? (dev branch)

I'm considering pushing it to master and making it a new release (r41),
as the results on ARM systems look pretty good so far.

Original comment by yann.col...@gmail.com on 13 Aug 2015 at 9:03

GoogleCodeExporter commented 8 years ago
It's better, but on the rpi test system:
23187H/s for xxHash
32125H/s for sha256
Then remain more slow than sha256.

Original comment by brule.he...@gmail.com on 13 Aug 2015 at 10:21

GoogleCodeExporter commented 8 years ago
strange, rpi results should be > 100 kH/s 
(when using 4 KB segments to hash for tests)

Original comment by yann.col...@gmail.com on 13 Aug 2015 at 10:40

GoogleCodeExporter commented 8 years ago

Original comment by yann.col...@gmail.com on 13 Aug 2015 at 10:43

GoogleCodeExporter commented 8 years ago
-funswitch-loops do the 48543H/s for xxHash into the -O3 switch. Then now some 
code modification will do the same effect.

Original comment by brule.he...@gmail.com on 13 Aug 2015 at 10:43

GoogleCodeExporter commented 8 years ago
I confirm the effect of -funswitch-loops when using your test program :

-O2 :
Result: 200000 in 2412ms
Mean:82918H/s

-O2 -funswitch-loops (or -O3) :
Result: 200000 in 2037ms
Mean:98183H/s

What's strange is that I don't need it on my own test program :

-O2 :
./xxhsum -b
XXH32               :     102400 ->   406.8 MB/s   0x955F2650                  
XXH32 (unaligned)   :     102399 ->   392.0 MB/s                   

-O2 -funswitch-loops (or -O3) :
./xxhsum -b
XXH32               :     102400 ->   406.3 MB/s   0x955F2650                  
XXH32 (unaligned)   :     102399 ->   391.5 MB/s                

basically, same result.

I don't really understand why.
I just note that xxhsum is "pure C", while your test program is a mix of C 
(xxhash) and C++ (main), therefore requiring g++ for the linking stage. That's 
the only major difference I can see.

Original comment by yann.col...@gmail.com on 13 Aug 2015 at 11:50

GoogleCodeExporter commented 8 years ago
I use C++11 o C11. I don't have time to analyse the assembly generated, but can 
be interesting.
It's the problem, the actual performance depends of the compiler parse 
optimization.

Original comment by brule.he...@gmail.com on 13 Aug 2015 at 12:35

GoogleCodeExporter commented 8 years ago
Performance fix published into r42
(now hosted on github : https://github.com/Cyan4973/xxHash/releases/tag/r42 )

Original comment by yann.col...@gmail.com on 21 Aug 2015 at 10:57

GoogleCodeExporter commented 8 years ago

Original comment by yann.col...@gmail.com on 21 Aug 2015 at 10:58