Closed GoogleCodeExporter closed 9 years ago
Fix typo:
" ... processing in interleaved blocks of 32-bytes ..."
change to:
" ... processing in blocks of 32 bytes (two 16 byte blocks interleaved) ..."
Original comment by moinakg
on 23 Jan 2013 at 3:50
Thanks for the proposed optimisation.
I've not yet been investigating your submitted source code, but i have a simple
question to begin with :
Is your proposed modification compatible with the current version of xxHash, or
does it produce a different hash value ?
Original comment by yann.col...@gmail.com
on 31 Jan 2013 at 1:40
Yes, unfortunately it does change the hash values from the original. However
both simd and non-simd code is updated to process data in 32-byte blocks to
have identical outputs.
Original comment by moinakg
on 31 Jan 2013 at 2:19
OK, thanks for the precision.
A few monthes ago, it would have been a no issue, since there was no notable
real-world use of xxHash.
But it has changed lately. The Java version is currenly in use within Lucene
and Solr, and the Ruby version seems to have reached an impressive popularity
within Rubygems developers.
Therefore, breaking changes should be considered with more caution.
Original comment by yann.col...@gmail.com
on 31 Jan 2013 at 4:13
Yes this is a consideration. Maybe there can be a new variant like xxhash-ng or
some such. BTW I found out that this approach is actually an implementation of
j-lanes hashing with j=2. This is described in the following paper:
http://eprint.iacr.org/2012/476.pdf
Original comment by moinakg
on 1 Feb 2013 at 2:34
[deleted comment]
Thanks for the link, it's a good read.
Interestingly, the paper was published after xxHash initial open source release
(April vs August).
Quite another demonstration that same ideas can appear in several places.
Original comment by yann.col...@gmail.com
on 1 Feb 2013 at 3:53
>Without SIMD but with 32-byte blocks I am seeing 36XX MB/s compared to 35XX
MB/s in the original (on my Core i5).
Is it x86 or x64 performance? i bet that 256 bits is too much for IA32
architecture that have only 6 or 7 general-purpose registers. So original
128-bit wide code is optimal for x86, your one - for x64, and for SSE/AVX
something like 1024/2048-bit beast would be the optimum. I also have GPU on the
board... :)
Original comment by bulat.zi...@gmail.com
on 10 Feb 2013 at 12:20
I am using only x64. Yes, with pure 32-bits there will be lots of memory moves
and 256-bit width is counterproductive. However there is this new X32 ABI
(circa Aug 2011) that allows all the x64 registers to be used with 32-bit
addressing:
http://lwn.net/Articles/456731/
https://sites.google.com/site/x32abi/
With GCC 4.6+ the -mx32 flag can be used. A disassembly snippet using -m32:
c9: 69 d2 b1 79 37 9e imul $0x9e3779b1,%edx,%edx
cf: 89 54 24 0c mov %edx,0xc(%esp)
d3: 69 50 0c 77 ca eb 85 imul $0x85ebca77,0xc(%eax),%edx
da: 03 54 24 08 add 0x8(%esp),%edx
de: c1 c2 0d rol $0xd,%edx
e1: 69 d2 b1 79 37 9e imul $0x9e3779b1,%edx,%edx
e7: 89 54 24 08 mov %edx,0x8(%esp)
eb: 69 d6 b1 79 37 9e imul $0x9e3779b1,%esi,%edx
f1: 89 54 24 10 mov %edx,0x10(%esp)
f5: 69 d3 b1 79 37 9e imul $0x9e3779b1,%ebx,%edx
fb: 89 54 24 14 mov %edx,0x14(%esp)
ff: 69 50 1c 77 ca eb 85 imul $0x85ebca77,0x1c(%eax),%edx
106: 83 c0 20 add $0x20,%eax
109: 01 fa add %edi,%edx
149: c1 c6 0d rol $0xd,%esi
Memory moves are evident. Now with -mx32:
f7: 45 69 e8 b1 79 37 9e imul $0x9e3779b1,%r8d,%r13d
fe: 44 69 f7 b1 79 37 9e imul $0x9e3779b1,%edi,%r14d
105: 69 d6 b1 79 37 9e imul $0x9e3779b1,%esi,%edx
10b: 44 69 f9 b1 79 37 9e imul $0x9e3779b1,%ecx,%r15d
112: 45 39 d1 cmp %r10d,%r9d
115: 0f 83 3d ff ff ff jae 58 <XXH32_SSE2+0x58>
11b: 45 69 c0 47 3b 4a fc imul $0xfc4a3b47,%r8d,%r8d
122: 4c 8b 4c 24 f8 mov -0x8(%rsp),%r9
127: 69 ff 47 3b 4a fc imul $0xfc4a3b47,%edi,%edi
12d: 44 01 c5 add %r8d,%ebp
130: 69 f6 47 3b 4a fc imul $0xfc4a3b47,%esi,%esi
136: 01 fb add %edi,%ebx
138: c1 c5 0d rol $0xd,%ebp
Original comment by moinakg
on 10 Feb 2013 at 1:43
Original comment by yann.col...@gmail.com
on 11 Feb 2013 at 1:41
-mx32 needs 64-bit computer, Linux with latest kernel, using only C++ and only
open-source libraries. but there is better solution to the problem: just unroll
more and proceed several even 16-byte chunks before swapping reg-mem and
processing corresponding odd chunks
btw, i would rather write to
http://encode.ru/threads/1538-Hash-Checksum-algorithm-considerations since
fGoogle requires me to recognize their captcha if more than 1 msg/day is
written here
Original comment by bulat.zi...@gmail.com
on 11 Feb 2013 at 5:31
> Maybe there can be a new variant like xxhash-ng or some such.
This is probably the better way to proceed.
I'll keep you informed.
Original comment by yann.col...@gmail.com
on 12 Feb 2013 at 1:26
Just for fun, I decided to try making an alternate patch that is compatible
with the existing xxHash implementation. I've only tested this using VS2012, so
I don't know how well it works on other compilers. But it should compile just
fine (the CPUID detection will return zero on unknown compilers--meaning that
SSE won't get used).
To keep my patch simple (it was done fairly quickly), I have only optimized the
XXH32 function. Though I don't think it would be terribly difficult to apply it
to the streaming version.
I've only tested this with a few random inputs, but they all match the original
implementation.
I haven't tested the performance at all, but I suspect it's faster with larger
inputs.
Original comment by thomaska...@gmail.com
on 26 Apr 2013 at 2:51
Attachments:
Nice patch Thomas !
It will be interesting to benchmark it.
Moinakg tests seems to point out that improvements through SSE2+ is not
guaranteed.
Original comment by yann.col...@gmail.com
on 26 Apr 2013 at 12:16
Nice patch, especially the CPUID part which I missed. However if you read my
experiences from my blog, doing a single hash round in the loop using SSE may
not provide a speed boost since the processor's multiple execution unit
capabilities are not fully utilized.
Original comment by moinakg
on 26 Apr 2013 at 2:40
i've placed my speed analysis here:
http://encode.ru/threads/1538-Hash-Checksum-algorithm-considerations?p=32955&vie
wfull=1#post32955
although it easily may be wrong since i just started to learn SSE :)
Original comment by bulat.zi...@gmail.com
on 26 Apr 2013 at 8:38
I finally went through and tested my SSE 4.1 patch with smhasher. Sadly, it
runs a lot slower on my Ivy Bridge processor. Using the xxHash without any SSE
support, I was getting approximately 2.204 bytes/cycle - 6306.59 MiB/sec @ 3
ghz (alignment 0). I determined that the SSE check that it does actually slows
down performance considerably (1.191 bytes/cycle - 3407.67 MiB/sec @ 3 ghz). If
I remove the check (always uses SSE 4.1 instructions, not recommended for
production use), it was faster, but still slower than the original code (1.978
bytes/cycle - 5658.57 MiB/sec @ 3 ghz).
I also tested a 64-bit build and the results were nearly identical.
So it probably suffers from the issues that moinakg is talking about. Thus, for
now, I don't think there is any reason to merge the SSE patch (at least for
modern processors). So I'm thinking it's just a proof-of-concept patch to show
that it is indeed possible to vectorize the first stage.
Original comment by thomaska...@gmail.com
on 26 Apr 2013 at 10:15
Yes, results seem comparable.
The only credible lead I've read so far is the possibility to load the 2nd SSE
register while making calculation on the first. Not even sure if that's
possible or if it has already been done btw.
Original comment by yann.col...@gmail.com
on 28 Apr 2013 at 8:41
Yann, since modern cpus perform commands speculatively, CPU by itself loads
next data while waiting for previous computation
Original comment by bulat.zi...@gmail.com
on 10 Jul 2013 at 4:34
Hi Bulat
From a cache perspective, yes, CPU starts to fill next cache line before it
gets requested if it detects a linear access pattern, which is quite common.
My comment was more related to register-level. If SSE registers are too scarce,
then it's not possible to "load next data segment while calculating". Which is
in contrast with the 32-bits version, which can achieve just that.
For more details, I would invite to read the excellent analysis of Moinakg, at :
http://moinakg.wordpress.com/2013/01/19/vectorizing-xxhash-for-fun-and-profit/
Original comment by yann.col...@gmail.com
on 11 Jul 2013 at 2:31
yes, i also meant loading from L1 cache. xxHash needs just 1 SSE register for
accumulator plus 2 SSE registers for constants so you have a good reserve of
them. you can even use load-and-add command and cpu will start loading before
it will be ready to make an addition since loading and adding will be two
separate micro-ops
btw, i have added new thread and post to you old thread about fast hashing.
please look at encode.ru
Original comment by bulat.zi...@gmail.com
on 11 Jul 2013 at 3:06
I've been testing this idea many times this year, with subtle variations, and
always the same result : vectorized version is slower.
I believe the key points are :
- Absence of vectorized rotl instruction. This can be emulated, but cost
several operations and registers, nullifying vectorized advantage.
- xxHash is very favorable to ILP (Instruction Level Parallelism), boosting
performance on modern OoO CPU when using traditional instruction set, reducing
the gain to expect from a vector unit.
As a consequence, I believe this issue will be closed, at least as long as we
don't have a decent rotl instruction available in vector unit.
Original comment by yann.col...@gmail.com
on 4 Dec 2013 at 1:46
That is odd since I always see a benefit, albeit small. Which CPU you tested on
?
Original comment by moinakg
on 4 Dec 2013 at 5:54
Previously it was a Core 2 Duo,
but today, it is a Core i5-3340M (Ivy Bridge)
Original comment by yann.col...@gmail.com
on 4 Dec 2013 at 6:17
Note also :
I'm talking about a version which is fully compatible with the current code,
not a modified one which would be more "tuned" for SIMD.
Regards
Original comment by yann.col...@gmail.com
on 4 Dec 2013 at 6:18
The current approach interleaves blocks of data that are too small to show SIMD
benefit. A vector rotl might just give a small benefit.
I have experimented with this as well and I believe there is no way to benefit
from SIMD and retain current hash value compatibility. Which is why I have a
new version (in Pcompress) for SIMD with the non-SIMD variant updated as well
to be compatible with the SIMD variant.
I have seen other parallel hashing approaches like Tree Hashing to be not
compatible with the regular versions as well.
Original comment by moinakg
on 8 Dec 2013 at 3:15
Yes, I know, your modified version is better for speed with vectorized
instructions.
Unfortunately, the algorithm being already deployed, it is mandatory for
XXH32() to keep full compatibility with the existing version.
I was wondering lately if newest AVX instructions could introduce any new
possibility to consider, which may change the conclusion compared to SSE.
Original comment by yann.col...@gmail.com
on 8 Dec 2013 at 10:17
Original comment by yann.col...@gmail.com
on 3 Mar 2014 at 3:21
Original issue reported on code.google.com by
moinakg
on 23 Jan 2013 at 3:47Attachments: