Closed GoogleCodeExporter closed 8 years ago
Thanks for your feedback.
There are a number of possibilities.
First, it seems you are using a 32-bits CPU/OS.
In such case, 64-bits arithmetic is going to be slow.
xxHash32 is meant to cover such situation : it's much faster on 32-bits CPU.
Obviously, its spreading capability is limited (32-bits vs 64-bits), so
depending on your use case, it can be acceptable or not.
As a work around, in case the difference between 32-bits & 64-bits is very
large (>2x), you can still consider using xxHash32 twice with 2 different seeds.
Next potential performance impact : aligned vs unaligned load. Especially on
ARM, and especially for 64-bits loads.
But let's move the needle one step after another.
Original comment by yann.col...@gmail.com
on 11 Aug 2015 at 3:58
Hello,
I use XXH32() on ARM Cortex A9 then 32Bits cpus (and OS). XXH32() is well
32Bits version? xxHash32 not found into the header
https://github.com/alphaonex86/CatchChallenger/blob/version-2/tools/xxhashbenchm
ark/xxhash.h
My actual use is sha224 4 first byte, I need be same hash on all cpu/OS to
compare remotely some file. Then xxHash32 is good, ins't it?
My benchmark on aligned/unaligned access:
http://catchchallenger.first-world.info/wiki/Benchmark_for_conception#memcpy_vs_
raw
My xxhash.* is extracted from last lz4 source.
Cheers,
Original comment by brule.he...@gmail.com
on 11 Aug 2015 at 4:22
I detect the changed file to keep sync a folder of file. 1/2^32 is acceptable
collision luck for me.
Original comment by brule.he...@gmail.com
on 11 Aug 2015 at 4:25
> I detect the changed file to keep sync a folder of file. 1/2^32 is acceptable
collision luck for me.
> My actual use is sha224 4 first byte, I need be same hash on all cpu/OS to
compare remotely some file. Then xxHash32 is good, ins't it?
Yes, sounds totally fine. xxHash is portable and endian independent, so it
shall give the same answer of any plarform.
> XXH32() is well 32Bits version?
Yes, it's the right function.
If that's the one you have already benchmarked, then we need to find another
performance bottleneck.
Possibility n2 : raw memory performance.
In some cases (embedded systems), accessing raw memory is simply very slow. Any
access outside of the cache takes a huge performance toll.
Difficult to say from your results. You mention "35kH/s". I suspect it means
35K hashes per second. But than what is hashed ? And especially what is its
size ?
For example, maybe your tested sample nicely fits into an Intel L2 cache, but
not within the ARM cache ? In such case, the difference could be explained just
by architecture.
Original comment by yann.col...@gmail.com
on 11 Aug 2015 at 4:33
> xxHash is portable and endian independent
Where found it? I have already hardware -> little endian function coded if
needed.
Yes 35K hash by second, on sha224 is done via the command "cryptsetup
benchmark" and xxhash on 4096 data block (fit in L1). See:
https://github.com/alphaonex86/CatchChallenger/blob/version-2/tools/xxhashbenchm
ark/main.cpp#L8
hdparm -T /dev/zero give 19471MB/sec on my x86, 727MB/sec on my ARM.
Grsec (Mostly PaX) slow down the memory access, well then but why it's at same
level than sha224 then? The memory allocator can change all, if memory
allocation/unallocation is happen it's maybe this.
Original comment by brule.he...@gmail.com
on 11 Aug 2015 at 5:04
Other diff, on my x86 the linux is configured under desktop mode, on ARM it's
configured on server mode (scheduler configured into the kernel).
The ARM is more slow, but only with xxhash I have this too many slow down.
Original comment by brule.he...@gmail.com
on 11 Aug 2015 at 5:11
OK, so the next candidate is aligned vs unaligned access.
This is only meaningfull if you feel that, in your application, data start is
going to be aligned on 4-bytes boundaries in most circumstances.
(This will be the case is you malloc() a buffer a just load your file into it)
To ensure this property, prefer :
unsigned int inputdata[1024];
in your test program.
This should help, but there is no guarantee.
The critical point is to ensure that the correct branch is taken by the code
(which require some debugging capability),
especially within XXH32() :
if ((((size_t)input) & 3) == 0) // <===== Ensure this branch exists and its
answer is "yes"
{
if ((endian_detected==XXH_littleEndian) || XXH_FORCE_NATIVE_FORMAT)
return XXH32_endian_align(input, len, seed, XXH_littleEndian, XXH_aligned);
else
return XXH32_endian_align(input, len, seed, XXH_bigEndian, XXH_aligned);
}
Original comment by yann.col...@gmail.com
on 11 Aug 2015 at 5:23
9090H/s with unsigned int inputdata[1024];
defined(XXH_USE_UNALIGNED_ACCESS) then never go into this section.
The input will always aligned in my case (benchmark as real case).
Original comment by brule.he...@gmail.com
on 11 Aug 2015 at 5:44
On Rpi 1, 13000H/s no grsec/PaX. 33000H/s for sha256.
Then worse than my previous ARM. XXH_USE_UNALIGNED_ACCESS set too.
Original comment by brule.he...@gmail.com
on 11 Aug 2015 at 5:55
> 9090H/s with unsigned int inputdata[1024];
That looks like a slow down ?
> defined(XXH_USE_UNALIGNED_ACCESS) then never go into this section.
the objective is to *not use unaligned access*, so make sure it is *not*
defined !
The code should reach this variant :
return XXH32_endian_align(input, len, seed, XXH_littleEndian, XXH_aligned);
In theory, the code should _automatically_ detect that its input is aligned,
and use this opportunity to request direct 32-bits memory accesses, without all
the precautionary code specific to CPU with poor unaligned accesses.
Original comment by yann.col...@gmail.com
on 11 Aug 2015 at 5:59
Yes, seam slow down.
I have undefined: XXH_USE_UNALIGNED_ACCESS
35000H/s
For #define XXH_FORCE_NATIVE_FORMAT 1:
200000H/s
Why it's dop dynamicly? Why it's not do statically?
Look at:
https://github.com/alphaonex86/CatchChallenger/blob/version-2/general/base/Porta
bleEndian.h
Original comment by brule.he...@gmail.com
on 11 Aug 2015 at 6:12
if ((((size_t)input) & 3) == 0) this is true, then aligned.
And endian_detected==XXH_littleEndian is true
Original comment by brule.he...@gmail.com
on 11 Aug 2015 at 6:20
That was my next candidate ...
The endian detection code is supposed to be summarized to a single test, at the
beginning of the function :
XXH_endianess endian_detected = (XXH_endianess)XXH_CPU_LITTLE_ENDIAN;
which then branch to the correct version of the algorithm. In many cases, the
test itself is "hard-coded". But even if it is not, it should be done only
once, and then propagated.
This is only true if the compiler does a good enough job at inlining. Hence the
FORCE_INLINE within :
FORCE_INLINE U32 XXH32_endian_align()
something which works well on all compilers I could test on, but is difficult
to guarantee on every compiler and every platform.
I also used to use static define for such detection, but I was keeping
receiving reports for unforeseen platforms (CPU +OS) which were either
undetected or, worse, wrongly detected. I therefore switched to dynamic, with
the assumption that the compiler should do a good enough job at simplifying
this test to a clean 0/1 result, and then removing branches wherever needed.
Original comment by yann.col...@gmail.com
on 11 Aug 2015 at 6:24
Here is gcc 4.8, then very common and updated compiler. As developper you need
help few bit the compiler some time ;)
No detected is not a problem. Wrongly detect it's big problem, then be more
strict on filter can help: only known config is forced as static: gcc 4.7+ and
x86, and ARM, ...
Dynamic implies will greatly depend of the instruction pipeline more than the
compiler, the compiler is more here to to the static part.
As you have see on rpi it's great regression, I think it should be very worse
on other single issue, look:
Rpi1: Single-issue, 2.5x more slow
Cortex A9: Partial dual-issue, in-order microarchitecture with an 8-stage
pipeline, 1x, same speed
Cortex A7: Partial dual-issue, in-order microarchitecture with an 8-stage
pipeline, xxHash: 13849H/s, Sha256: 59578H/s
x86 Intel haswell: Multple issue, 14- to 19-stage instruction pipeline: 2.6x
MORE FAST than accelerated sha2 hardware (already fast).
And I'm sure worse on:
Cortex A5: Single-issue, in-order microarchitecture with an 8-stage pipeline
Or AMD geode. I don't have time to test that's.
The inline dynamic is 14x more slow in debug than normal compilation (-02) on
gcc.
The main target of xxHash is to be more fast than other, mostly sha256 which is
true crytographic hash.
I can provide you a virtual machine arm test in private if you want.
Original comment by brule.he...@gmail.com
on 11 Aug 2015 at 7:47
FORCE_INLINE U32 XXH32_endian_align(const void* input, size_t len, U32 seed,
XXH_endianess endian, XXH_alignment align) -> another code problem:
while(){if} will ever more slow than if{while1}else{while2}
Lot of condition into the while should be put out to improve performance where
the branch prediction hardware unit is poor/slow.
Original comment by brule.he...@gmail.com
on 12 Aug 2015 at 8:27
The last tip provide a large boost 8000->85800 then 1072% more speed on rpi1.
Drop XXH_endianess endian, XXH_alignment align on define
XXH_USE_UNALIGNED_ACCESS or/and XXH_FORCE_NATIVE_FORMAT will help the compiler
to minimize the pressure on stack/register on generic way to have more
performance one all architecture.
Original comment by brule.he...@gmail.com
on 12 Aug 2015 at 8:50
These are very good hints.
Indeed, having direct access to real hardware for tests
is key to observe such effects.
> The inline dynamic is 14x more slow in debug than normal compilation (-02) on
gcc.
Performance in debug mode does not matter. Debug mode is only there to observe
code correctness. It's not indicative of final performance.
What matters is that these "dynamic" tests get simplified to static
compile-time 0/1, and then the relevant branches shall be removed during
compilation (dead code elimination), producing a clean straightforward code.
As stated, it depends a lot on compiler's optimizations. gcc 4.8 should be good
enough for this, although maybe its final effectiveness vary depending on final
hardware.
Minor point : note that I'm compiling with -O3 rather than -O2. I'm not
expecting large difference, but maybe in this case ...
> I can provide you a virtual machine arm test in private if you want.
Yes, an SSH term to such system would be welcomed.
Original comment by yann.col...@gmail.com
on 12 Aug 2015 at 10:17
In can open you an access on real hardware, I have rpi1, cubieboard 2,
odroid-u3, cubox-i, ... see more at
http://catchchallenger.first-world.info/wiki/Portable_datacenter
In some part you are in right, but if you have way to improve the performance
in all case include without strong optimization -02 (loop interchange, ...), I
think it's better. Ever more to profile into debug mode.
The world have so many compiler, OS and hardware. This simply thread show this
theory of the compiler will do the job is wrong. But in this case it's for me,
due to the code not the compiler. The compiler too need respect the code
written.
My previous comment show something, the code is bad written, we don't have to
hope for the compiler but fix the code.
I will send in private way the pass, or send my your ssh public key:
http://ultracopier.first-world.info/contact.html
Original comment by brule.he...@gmail.com
on 12 Aug 2015 at 11:13
Some useful link:
https://en.wikipedia.org/wiki/Loop_optimization
;)
Original comment by brule.he...@gmail.com
on 12 Aug 2015 at 11:35
Interesting note:
* gzip -1 vs lz4 -1 on x86: lz4 6.2x more fast
* gzip -1 vs lz4 -1 on ARM: lz4 3.6x more fast
Original comment by brule.he...@gmail.com
on 12 Aug 2015 at 2:16
That's why I'm interested in looking into such issues ;)
Original comment by yann.col...@gmail.com
on 12 Aug 2015 at 2:21
The ssh connexion works.
I've downloaded xxHash from github, as a comparison exercise on target system.
cat /proc/cpuinfo
processor : 0
model name : ARMv6-compatible processor rev 7 (v6l)
BogoMIPS : 2.00
Features : half thumb fastmult vfp edsp java tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xb76
CPU revision : 7
Hardware : BCM2708
Revision : 0002
Right from the box, using r40 and xxhsum internal memory benchmark, I got the
following figures :
XXH32 : 7520 -> 267.0 MB/s 0x621B687C
XXH32 (unaligned : 7519 -> 242.6 MB/s
XXH64 : 7520 -> 113.0 MB/s 0xC256B0CD7B86609B
Keeping in mind your own test is on 4KB memory segment, this translates into
roughly 66 kH/s.
This is less than 86 kH/s as reported by your benchmark tool, but not vastly
below either.
As a test, I compiled xxhsum using -O2 instead (default Makefile uses -O3).
Here are the results :
XXH32 : 7520 -> 73.7 MB/s 0x621B687C
XXH32 (unaligned : 7519 -> 71.8 MB/s
XXH64 : 7520 -> 85.0 MB/s 0xC256B0CD7B86609B
which is significantly worse. The difference on x86/x64 platforms is much less
pronounced, so this impact is related to this platform/compiler combination (As
a sidenote, I see installed GCC version is v4.7.4).
This translates into ~18 kH/s, which might be closer to what you were
experiencing at the beginning.
I also re-used the xxhash.c version stored at root directory, and compiled a
new xxhsum with it. I expected it would feature the modifications you were
talking about, and corresponding performance improvements.
But I'm not sure it does, since performance is identical to vanilla r40, on
both -O3 and -O2 settings.
This is in contrast with your test program, xxhashbenchmark, which displays :
Result: 200000 in 2324ms
Mean:86058H/s
XXH_USE_UNALIGNED_ACCESS
I was expecting it to use your own xxHash variant to reach such figure.
Original comment by yann.col...@gmail.com
on 12 Aug 2015 at 3:12
New test :
Using vanilla xxhash r40,
but making sure that XXH_USE_UNALIGNED_ACCESS is disabled.
(just comment the line :
//# define XXH_USE_UNALIGNED_ACCESS 1)
xxhsum internal benchmark gives :
XXH32 : 7520 -> **419.6 MB/s** 0x621B687C
XXH32 (unaligned : 7519 -> 242.5 MB/s
XXH64 : 7520 -> 121.7 MB/s 0xC256B0CD7B86609B
This is a great improvement, although only for 32-bits aligned access.
**420 MB/s** roughly translates into 105 kH/s, so that's even better than
previous result from xxhashbenchmark.
It really shows that precautionary code is being used when accessing
non-aligned data, and that precautionary code cost some performance.
As a side exercice, I also tried to compile xxhsum using -O2 and here are the
results :
XXH32 : 7520 -> **419.6 MB/s** 0x621B687C
XXH32 (unaligned : 7519 -> 72.2 MB/s
XXH64 : 7520 -> 121.6 MB/s 0xC256B0CD7B86609B
Interestingly, performance on aligned data is about the same as -O3.
So basically, it seems to be the single point of improvement : make sure this
define is not set.
I might as well remove it in a future revision...
Original comment by yann.col...@gmail.com
on 12 Aug 2015 at 3:14
Another test :
I forced usage of direct 32-bits accesses even on non-aligned data (it requires
to pretend that data is aligned even when it's not). Here are the results :
XXH32 : 7520 -> 404.6 MB/s 0x621B687C
XXH32 (unaligned : 7519 -> __387.5 MB/s__
XXH64 : 7520 -> 121.6 MB/s 0xC256B0CD7B86609B
Interestingly enough , __ -O2 compilation settings produce same speed _ .
This is a lot better for 32-bits unaligned access !
That means this cpu is able to efficiently access unaligned memory.
This is typical of problems I've got with ARM :
Some CPU are unable to access unaligned memory, and crash on such attempts.
Others are able to access, but have poor performance.
Sometimes, performance is good for 32-bits, but poor for 64-bits.
And finally, some are simply very good to handle unaligned memory accesses.
Almost Intel-like (which is best in class).
Handling all these use-cases is difficult, and a gigantic set of #define would
be required. Even in such cases, there are too many sub-families and specific
behavior, so it woud never be generic enough.
What is supposed to be generic is to trust the compiler : tell it to access
memory, and let it choose how to do it best for the target cpu.
The way to do it is typically :
memcpy(memPtr, &value, sizeof(value));
On x86/x64, it works, and this function call is translated into a single mov
instructions, since Intel cpu correctly and efficiently handle unaligned memory
accesses.
But for ARM, no such outcome. It turns out, compiler is in fact very cautious.
I guess that's because it doesn't know that well the specificities of each CPU,
and therefore tend to lean on the safe side. Compiled binary works, it's just
vastly sub-optimal.
The only way I found to counter-balance this problem is to add a manual switch
(#define), to force the compiler to issue direct instructions.
This method has 2 drawbacks :
- Most people don't realize that the switch is present. They just compile the
code "as is". So any such switch tend to be overlooked.
- Forcing the use of direct access can have nasty side-effects on some advanced
optimizations. For example, on x64, newer compilers (GCC 5+) try to
auto-vectorize some portions of code. Direct access is then assumed to be a
guarantee of being aligned. If it's not, it results in buggy binaries.
So basically, this switch must be manual, so that the user take full
responsibility for this choice. If it crashes, it's enough to revert it. But I
can't have "default" switches which may produce buggy binaries.
Anyway, one big issue observed here is that XXH_USE_UNALIGNED_ACCESS switch
does the reverse of what it's supposed to do => it slows down memory accesses
on CPU which can efficiently handle it ! This is the result of several changes,
with no way to check the changes on real hardware. That's enough a conclusion
to look for a correction.
Original comment by yann.col...@gmail.com
on 12 Aug 2015 at 3:19
-O2 and -O3 do a big change? If yes the C code need be changed to improve the
performance on most common program packaging lz4. Analising the generated code
can be useful.
Original comment by brule.he...@gmail.com
on 12 Aug 2015 at 3:39
-O2 and -O3 have identical performance with direct memory accesses.
Differences exist when the compiler generates "cautious" code (for unaligned
memory accesses).
Original comment by yann.col...@gmail.com
on 12 Aug 2015 at 3:52
Latest version of xxHash within "dev" branch seems to improve this situation,
by a fairly large margin :
https://github.com/Cyan4973/xxHash/tree/dev
To test it, you can either sync your git directory and get into dev branch, or
download the automated github archive :
https://github.com/Cyan4973/xxHash/archive/dev.zip
This version offers same performance at -O2 and -O3 levels on ARM rpi target.
It seems to enable automatically direct unaligned memory accesses, meaning
performance is good for both cases (aligned & unaligned) :
XXH32 : 102400 -> 406.7 MB/s 0x955F2650 (for both -O2 and
-O3)
XXH32 (unaligned : 102399 -> 392.0 MB/s (for both -O2 and
-O3)
Otherwise, I noticed that when this option is manually turned off (by
commenting XXH_FORCE_DIRECT_MEMORY_ACCESS),
aligned performance is slightly better, while unaligned one is vastly worse :
XXH32 : 102400 -> 422.6 MB/s 0x955F2650 (for both -O2 and
-O3)
XXH32 (unaligned : 102399 -> 244.3 MB/s ( -O3 )
XXH32 (unaligned : 102399 -> 72.4 MB/s ( -O2 )
Not sure why, and not sure if the tiny gain for the aligned memory case is
worth the large drop for unaligned one.
test system specs :
lscpu
Architecture: armv6l
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
Model name: ARMv6-compatible processor rev 7 (v6l)
CPU max MHz: 700.0000
CPU min MHz: 700.0000
Original comment by yann.col...@gmail.com
on 12 Aug 2015 at 5:33
For integration into program good performance is needed is all case, if some
time is very better than sha256, other time very worse I prefer use sha256 into
my program because more and more the cpu will be helped by crytographic cpu.
Difference between -O2 and -O3 can be see by: gcc -Q -O2 --help=optimizers and
gcc -Q -O3 --help=optimizers
On this gcc under rpi is: -fgcse-after-reload -finline-functions -fipa-cp-clone
-fpredictive-commoning -ftree-loop-distribute-patterns -ftree-vectorize
-funswitch-loops is enable into O3.
Mean it's possible to force it by changing the code. We need determine too what
flags have what influencie.
Original comment by brule.he...@gmail.com
on 12 Aug 2015 at 5:57
Have you been able to test the newer xxHash version ? (dev branch)
I'm considering pushing it to master and making it a new release (r41),
as the results on ARM systems look pretty good so far.
Original comment by yann.col...@gmail.com
on 13 Aug 2015 at 9:03
It's better, but on the rpi test system:
23187H/s for xxHash
32125H/s for sha256
Then remain more slow than sha256.
Original comment by brule.he...@gmail.com
on 13 Aug 2015 at 10:21
strange, rpi results should be > 100 kH/s
(when using 4 KB segments to hash for tests)
Original comment by yann.col...@gmail.com
on 13 Aug 2015 at 10:40
Original comment by yann.col...@gmail.com
on 13 Aug 2015 at 10:43
-funswitch-loops do the 48543H/s for xxHash into the -O3 switch. Then now some
code modification will do the same effect.
Original comment by brule.he...@gmail.com
on 13 Aug 2015 at 10:43
I confirm the effect of -funswitch-loops when using your test program :
-O2 :
Result: 200000 in 2412ms
Mean:82918H/s
-O2 -funswitch-loops (or -O3) :
Result: 200000 in 2037ms
Mean:98183H/s
What's strange is that I don't need it on my own test program :
-O2 :
./xxhsum -b
XXH32 : 102400 -> 406.8 MB/s 0x955F2650
XXH32 (unaligned) : 102399 -> 392.0 MB/s
-O2 -funswitch-loops (or -O3) :
./xxhsum -b
XXH32 : 102400 -> 406.3 MB/s 0x955F2650
XXH32 (unaligned) : 102399 -> 391.5 MB/s
basically, same result.
I don't really understand why.
I just note that xxhsum is "pure C", while your test program is a mix of C
(xxhash) and C++ (main), therefore requiring g++ for the linking stage. That's
the only major difference I can see.
Original comment by yann.col...@gmail.com
on 13 Aug 2015 at 11:50
I use C++11 o C11. I don't have time to analyse the assembly generated, but can
be interesting.
It's the problem, the actual performance depends of the compiler parse
optimization.
Original comment by brule.he...@gmail.com
on 13 Aug 2015 at 12:35
Performance fix published into r42
(now hosted on github : https://github.com/Cyan4973/xxHash/releases/tag/r42 )
Original comment by yann.col...@gmail.com
on 21 Aug 2015 at 10:57
Original comment by yann.col...@gmail.com
on 21 Aug 2015 at 10:58
Original issue reported on code.google.com by
brule.he...@gmail.com
on 11 Aug 2015 at 3:42