dmolik / xxhash

Automatically exported from code.google.com/p/xxhash
Other
0 stars 0 forks source link

Vectorized XXHash #4

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Attaching my diffs for SIMD vectorization of XXhash on x86 platforms. This 
changes the hashing approach in general to process in blocks of 32-bytes with 
or without SIMD to get uniform hash values.

Even without SIMD, processing in interleaved blocks of 32-bytes is about 3% 
faster than the original. Without SIMD but with 32-byte blocks I am seeing 36XX 
MB/s compared to 35XX MB/s in the original (on my Core i5). With SIMD it is 
anywhere from 12% to 15% faster.

These diffs also fix the earlier issue of slightly degraded small key 
performance.

Unfortunately the SIMD version requires at least SSE 4.1. Performance with SSE2 
is not good. SSE2 lacks the _mm_mullo_epi32() capability and it requires a 
bunch of instructions to emulate the behaviour. The combined overhead of those 
instructions is too large and takes away the vectorization benefit altogether.

Original issue reported on code.google.com by moinakg on 23 Jan 2013 at 3:47

Attachments:

GoogleCodeExporter commented 9 years ago
Fix typo:

" ... processing in interleaved blocks of 32-bytes ..."

change to:

" ... processing in blocks of 32 bytes (two 16 byte blocks interleaved) ..." 

Original comment by moinakg on 23 Jan 2013 at 3:50

GoogleCodeExporter commented 9 years ago
Thanks for the proposed optimisation.

I've not yet been investigating your submitted source code, but i have a simple 
question to begin with :
Is your proposed modification compatible with the current version of xxHash, or 
does it produce a different hash value ?

Original comment by yann.col...@gmail.com on 31 Jan 2013 at 1:40

GoogleCodeExporter commented 9 years ago
Yes, unfortunately it does change the hash values from the original. However 
both simd and non-simd code is updated to process data in 32-byte blocks to 
have identical outputs.

Original comment by moinakg on 31 Jan 2013 at 2:19

GoogleCodeExporter commented 9 years ago
OK, thanks for the precision.

A few monthes ago, it would have been a no issue, since there was no notable 
real-world use of xxHash. 

But it has changed lately. The Java version is currenly in use within Lucene 
and Solr, and the Ruby version seems to have reached an impressive popularity 
within Rubygems developers.
Therefore, breaking changes should be considered with more caution.

Original comment by yann.col...@gmail.com on 31 Jan 2013 at 4:13

GoogleCodeExporter commented 9 years ago
Yes this is a consideration. Maybe there can be a new variant like xxhash-ng or 
some such. BTW I found out that this approach is actually an implementation of 
j-lanes hashing with j=2. This is described in the following paper:
http://eprint.iacr.org/2012/476.pdf

Original comment by moinakg on 1 Feb 2013 at 2:34

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Thanks for the link, it's a good read.

Interestingly, the paper was published after xxHash initial open source release 
(April vs August).
Quite another demonstration that same ideas can appear in several places.

Original comment by yann.col...@gmail.com on 1 Feb 2013 at 3:53

GoogleCodeExporter commented 9 years ago
>Without SIMD but with 32-byte blocks I am seeing 36XX MB/s compared to 35XX 
MB/s in the original (on my Core i5).

Is it x86 or x64 performance? i bet that 256 bits is too much for IA32 
architecture that have only 6 or 7 general-purpose registers. So original 
128-bit wide code is optimal for x86, your one - for x64, and for SSE/AVX 
something like 1024/2048-bit beast would be the optimum. I also have GPU on the 
board... :)

Original comment by bulat.zi...@gmail.com on 10 Feb 2013 at 12:20

GoogleCodeExporter commented 9 years ago
I am using only x64. Yes, with pure 32-bits there will be lots of memory moves 
and 256-bit width is counterproductive. However there is this new X32 ABI 
(circa Aug 2011) that allows all the x64 registers to be used with 32-bit 
addressing:

http://lwn.net/Articles/456731/
https://sites.google.com/site/x32abi/

With GCC 4.6+ the -mx32 flag can be used. A disassembly snippet using -m32:

  c9:   69 d2 b1 79 37 9e       imul   $0x9e3779b1,%edx,%edx
  cf:   89 54 24 0c             mov    %edx,0xc(%esp)
  d3:   69 50 0c 77 ca eb 85    imul   $0x85ebca77,0xc(%eax),%edx
  da:   03 54 24 08             add    0x8(%esp),%edx
  de:   c1 c2 0d                rol    $0xd,%edx
  e1:   69 d2 b1 79 37 9e       imul   $0x9e3779b1,%edx,%edx
  e7:   89 54 24 08             mov    %edx,0x8(%esp)
  eb:   69 d6 b1 79 37 9e       imul   $0x9e3779b1,%esi,%edx
  f1:   89 54 24 10             mov    %edx,0x10(%esp)
  f5:   69 d3 b1 79 37 9e       imul   $0x9e3779b1,%ebx,%edx
  fb:   89 54 24 14             mov    %edx,0x14(%esp)
  ff:   69 50 1c 77 ca eb 85    imul   $0x85ebca77,0x1c(%eax),%edx
 106:   83 c0 20                add    $0x20,%eax
 109:   01 fa                   add    %edi,%edx
 149:   c1 c6 0d                rol    $0xd,%esi

Memory moves are evident. Now with -mx32:

  f7:   45 69 e8 b1 79 37 9e    imul   $0x9e3779b1,%r8d,%r13d
  fe:   44 69 f7 b1 79 37 9e    imul   $0x9e3779b1,%edi,%r14d
 105:   69 d6 b1 79 37 9e       imul   $0x9e3779b1,%esi,%edx
 10b:   44 69 f9 b1 79 37 9e    imul   $0x9e3779b1,%ecx,%r15d
 112:   45 39 d1                cmp    %r10d,%r9d
 115:   0f 83 3d ff ff ff       jae    58 <XXH32_SSE2+0x58>
 11b:   45 69 c0 47 3b 4a fc    imul   $0xfc4a3b47,%r8d,%r8d
 122:   4c 8b 4c 24 f8          mov    -0x8(%rsp),%r9
 127:   69 ff 47 3b 4a fc       imul   $0xfc4a3b47,%edi,%edi
 12d:   44 01 c5                add    %r8d,%ebp
 130:   69 f6 47 3b 4a fc       imul   $0xfc4a3b47,%esi,%esi
 136:   01 fb                   add    %edi,%ebx
 138:   c1 c5 0d                rol    $0xd,%ebp

Original comment by moinakg on 10 Feb 2013 at 1:43

GoogleCodeExporter commented 9 years ago

Original comment by yann.col...@gmail.com on 11 Feb 2013 at 1:41

GoogleCodeExporter commented 9 years ago
-mx32 needs 64-bit computer, Linux with latest kernel, using only C++ and only 
open-source libraries. but there is better solution to the problem: just unroll 
more and proceed several even 16-byte chunks before swapping reg-mem and 
processing corresponding odd chunks

btw, i would rather write to 
http://encode.ru/threads/1538-Hash-Checksum-algorithm-considerations since 
fGoogle requires me to recognize their captcha if more than 1 msg/day is 
written here

Original comment by bulat.zi...@gmail.com on 11 Feb 2013 at 5:31

GoogleCodeExporter commented 9 years ago
> Maybe there can be a new variant like xxhash-ng or some such.

This is probably the better way to proceed.
I'll keep you informed.

Original comment by yann.col...@gmail.com on 12 Feb 2013 at 1:26

GoogleCodeExporter commented 9 years ago
Just for fun, I decided to try making an alternate patch that is compatible 
with the existing xxHash implementation. I've only tested this using VS2012, so 
I don't know how well it works on other compilers. But it should compile just 
fine (the CPUID detection will return zero on unknown compilers--meaning that 
SSE won't get used).

To keep my patch simple (it was done fairly quickly), I have only optimized the 
XXH32 function. Though I don't think it would be terribly difficult to apply it 
to the streaming version.

I've only tested this with a few random inputs, but they all match the original 
implementation.

I haven't tested the performance at all, but I suspect it's faster with larger 
inputs.

Original comment by thomaska...@gmail.com on 26 Apr 2013 at 2:51

Attachments:

GoogleCodeExporter commented 9 years ago
Nice patch Thomas !

It will be interesting to benchmark it. 
Moinakg tests seems to point out that improvements through SSE2+ is not 
guaranteed.

Original comment by yann.col...@gmail.com on 26 Apr 2013 at 12:16

GoogleCodeExporter commented 9 years ago
Nice patch, especially the CPUID part which I missed. However if you read my 
experiences from my blog, doing a single hash round in the loop using SSE may 
not provide a speed boost since the processor's multiple execution unit 
capabilities are not fully utilized. 

Original comment by moinakg on 26 Apr 2013 at 2:40

GoogleCodeExporter commented 9 years ago
i've placed my speed analysis here: 
http://encode.ru/threads/1538-Hash-Checksum-algorithm-considerations?p=32955&vie
wfull=1#post32955

although it easily may be wrong since i just started to learn SSE :)

Original comment by bulat.zi...@gmail.com on 26 Apr 2013 at 8:38

GoogleCodeExporter commented 9 years ago
I finally went through and tested my SSE 4.1 patch with smhasher. Sadly, it 
runs a lot slower on my Ivy Bridge processor. Using the xxHash without any SSE 
support, I was getting approximately 2.204 bytes/cycle - 6306.59 MiB/sec @ 3 
ghz (alignment 0). I determined that the SSE check that it does actually slows 
down performance considerably (1.191 bytes/cycle - 3407.67 MiB/sec @ 3 ghz). If 
I remove the check (always uses SSE 4.1 instructions, not recommended for 
production use), it was faster, but still slower than the original code (1.978 
bytes/cycle - 5658.57 MiB/sec @ 3 ghz).

I also tested a 64-bit build and the results were nearly identical.

So it probably suffers from the issues that moinakg is talking about. Thus, for 
now, I don't think there is any reason to merge the SSE patch (at least for 
modern processors). So I'm thinking it's just a proof-of-concept patch to show 
that it is indeed possible to vectorize the first stage.

Original comment by thomaska...@gmail.com on 26 Apr 2013 at 10:15

GoogleCodeExporter commented 9 years ago
Yes, results seem comparable.

The only credible lead I've read so far is the possibility to load the 2nd SSE 
register while making calculation on the first. Not even sure if that's 
possible or if it has already been done btw.

Original comment by yann.col...@gmail.com on 28 Apr 2013 at 8:41

GoogleCodeExporter commented 9 years ago
Yann, since modern cpus perform commands speculatively, CPU by itself loads 
next data while waiting for previous computation

Original comment by bulat.zi...@gmail.com on 10 Jul 2013 at 4:34

GoogleCodeExporter commented 9 years ago
Hi Bulat

From a cache perspective, yes, CPU starts to fill next cache line before it 
gets requested if it detects a linear access pattern, which is quite common.

My comment was more related to register-level. If SSE registers are too scarce, 
then it's not possible to "load next data segment while calculating". Which is 
in contrast with the 32-bits version, which can achieve just that.

For more details, I would invite to read the excellent analysis of Moinakg, at :
http://moinakg.wordpress.com/2013/01/19/vectorizing-xxhash-for-fun-and-profit/

Original comment by yann.col...@gmail.com on 11 Jul 2013 at 2:31

GoogleCodeExporter commented 9 years ago
yes, i also meant loading from L1 cache. xxHash needs just 1 SSE register for 
accumulator plus 2 SSE registers for constants so you have a good reserve of 
them. you can even use load-and-add command and cpu will start loading before 
it will be ready to make an addition since loading and adding will be two 
separate micro-ops

btw, i have added new thread and post to you old thread about fast hashing. 
please look at encode.ru

Original comment by bulat.zi...@gmail.com on 11 Jul 2013 at 3:06

GoogleCodeExporter commented 9 years ago
I've been testing this idea many times this year, with subtle variations, and 
always the same result : vectorized version is slower.

I believe the key points are :
- Absence of vectorized rotl instruction. This can be emulated, but cost 
several operations and registers, nullifying vectorized advantage.
- xxHash is very favorable to ILP (Instruction Level Parallelism), boosting 
performance on modern OoO CPU when using traditional instruction set, reducing 
the gain to expect from a vector unit.

As a consequence, I believe this issue will be closed, at least as long as we 
don't have a decent rotl instruction available in vector unit.

Original comment by yann.col...@gmail.com on 4 Dec 2013 at 1:46

GoogleCodeExporter commented 9 years ago
That is odd since I always see a benefit, albeit small. Which CPU you tested on 
?

Original comment by moinakg on 4 Dec 2013 at 5:54

GoogleCodeExporter commented 9 years ago
Previously it was a Core 2 Duo,
but today, it is a Core i5-3340M (Ivy Bridge)

Original comment by yann.col...@gmail.com on 4 Dec 2013 at 6:17

GoogleCodeExporter commented 9 years ago
Note also : 
I'm talking about a version which is fully compatible with the current code,
not a modified one which would be more "tuned" for SIMD.

Regards

Original comment by yann.col...@gmail.com on 4 Dec 2013 at 6:18

GoogleCodeExporter commented 9 years ago
The current approach interleaves blocks of data that are too small to show SIMD 
benefit. A vector rotl might just give a small benefit.
I have experimented with this as well and I believe there is no way to benefit 
from SIMD and retain current hash value compatibility. Which is why I have a 
new version (in Pcompress) for SIMD with the non-SIMD variant updated as well 
to be compatible with the SIMD variant.
I have seen other parallel hashing approaches like Tree Hashing to be not 
compatible with the regular versions as well.

Original comment by moinakg on 8 Dec 2013 at 3:15

GoogleCodeExporter commented 9 years ago
Yes, I know, your modified version is better for speed with vectorized 
instructions.

Unfortunately, the algorithm being already deployed, it is mandatory for 
XXH32() to keep full compatibility with the existing version.

I was wondering lately if newest AVX instructions could introduce any new 
possibility to consider, which may change the conclusion compared to SSE.

Original comment by yann.col...@gmail.com on 8 Dec 2013 at 10:17

GoogleCodeExporter commented 9 years ago

Original comment by yann.col...@gmail.com on 3 Mar 2014 at 3:21