Using SSE intrinsics in LZ4HC

GoogleCodeExporter commented 8 years ago

While integrating LZ4 into my Pcompress utility 
(http://moinakg.github.com/pcompress/), I did some small experimentation using 
SSE intrinsics for the high compression version and got a slight speedup. I 
have attached the patch I was using to test via the bench and fuzzer utilities. 

I was doing all tests on my 2yr old Core i5 laptop. Results on 2 archives below.

SunStudio compiler archive - text (headers, english text, HTML docs) and ELF 
binaries.
Non SSE:
*** Compression CLI using LZ4 algorithm , by Yann Collet (Dec 11 2012) ***
sunstudio12u1-pa : 963685888 -> 297457076 (30.87%),   21.7 MB/s , 1107.3 MB/s

SSE:
*** Compression CLI using LZ4 algorithm , by Yann Collet (Dec 11 2012) ***
sunstudio12u1-pa : 963685888 -> 297457076 (30.87%),   22.8 MB/s , 1117.1 MB/s

Tarball of the entire Silesia corpus.
Non SSE:
*** Compression CLI using LZ4 algorithm , by Yann Collet (Dec 11 2012) ***
../cmp_test/sile : 206612480 ->  77219127 (37.37%),   14.7 MB/s ,  888.0 MB/s

SSE:
*** Compression CLI using LZ4 algorithm , by Yann Collet (Dec 11 2012) ***
../cmp_test/sile : 206612480 ->  77219127 (37.37%),   15.4 MB/s ,  898.7 MB/s

The patch uses unaligned SSE loads to keep things simple but still shows 
improvement. However the patch is missing platform detection which I do in my 
utility via a configure script.

Original issue reported on code.google.com by moinakg on 11 Dec 2012 at 5:47

Attachments:

lz4hc.c.diff

GoogleCodeExporter commented 8 years ago

Thanks, it's an interesting improvement. I'll look into it.

Original comment by yann.col...@gmail.com on 11 Dec 2012 at 10:14

GoogleCodeExporter commented 8 years ago

Enhancement request.

Original comment by yann.col...@gmail.com on 11 Dec 2012 at 2:44

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 8 years ago

Hi i've been working on your proposed speed improvement lately.

I'm getting into a few difficulties, so let's try to describe them here :

- __builtin_ctz() is a GCC-only function (and even then, only if version is >= 
3.4).

- The following line is a bit cryptic to me : 
_mm_movemask_epi8(_mm_cmpeq_epi8(span1, span2))

Here is what i could find about it :
- _mm_movemask_epi8 : Creates a 16-bit mask from the most significant bits of 
the 16 signed or unsigned 8-bit integers in a and zero extends the upper bits 
- _mm_cmpeq_epi8 : Compares the 16 signed or unsigned 8-bit integers in a and 
the 16 signed or unsigned 8-bit integers in b for equality

So the first operation seems to compare the 128-bits fields byte-by-byte, and 
return either a 0xFF if equal, or a 0x00 if not.

MS documentation Pseudo code : 
r0 := (a0 == b0) ? 0xff : 0x0
r1 := (a1 == b1) ? 0xff : 0x0
(...)

Then the second operation grabs the upper bit of each byte comparison, which is 
1 if equal, or 0 is not equal.

Consequently, the number of common bytes is the number of 1 starting from the 
lowest bit.

But this should not work. There seems to be an inversion between 0 and 1, at 
least by judging from MS pseudo code. We should get a 0 if equal, and a 1 if 
not equal.

Original comment by yann.col...@gmail.com on 3 Jan 2013 at 3:55

GoogleCodeExporter commented 8 years ago

OK, i've made a few more tests.

I've created the function LZ4_NbCommonBits(), which should be the equivalent of 
__builtin_ctz(), but more portable. It is based on existing function 
LZ4_NbCommonBytes().

With this sole modification, i've run some tests with your proposed SSE 
modifications.

It doesn't work well. The compression ratio is negatively affected, and the 
speed is now 3 times slower. (it is easily possible to compare both versions by 
turning on/off the #define).

Original comment by yann.col...@gmail.com on 3 Jan 2013 at 4:23

GoogleCodeExporter commented 8 years ago

First off:
mask = _mm_movemask_epi8(_mm_cmpeq_epi8(span1, span2)) ^ 0xffff;

Generates a 16-bit mask with one bit set to 0 for each equal byte in the span1 
and span2 128bit XMM registers. This is done so that I could use 
__builtin_ctz() (bsf instruction).

These are the results of my tests using the silesia corpus tarball:
======== Non SSE
time ./pcompress -c lz4 -l3 -s200m silesia.tar 
Scaling to 1 thread
Checksum computed at 236.894 MB/s
Chunk compression speed 5.435 MB/s

real    0m14.569s
user    0m13.737s
sys 0m0.292s
ls -l silesia.tar.pz 
-rw-rw-r--. 1 moinakg moinakg 77149484 Jan  3 22:32 silesia.tar.pz

========= SSE
time ./pcompress -c lz4 -l3 -s200m silesia.tar 
Scaling to 1 thread
Checksum computed at 242.630 MB/s
Chunk compression speed 5.706 MB/s

real    0m13.895s
user    0m13.316s
sys 0m0.197s
ls -l silesia.tar.pz 
-rw-rw-r--. 1 moinakg moinakg 77149484 Jan  3 22:37 silesia.tar.pz

These are quick-n-dirty time measures printed based on milliseconds spent in 
the single iteration.
It is slightly faster and compression ratio is same. I am using a Core i5 M430, 
2.27 GHz. The results can be faster if aligned loads were possible. That is 
using _mm_load_si128() instead of _mm_loadu_si128(). However alignment cannot 
be guaranteed there. Good thing is unaligned SSE read/write penalty virtually 
disappears in Westmere2 and Sandy Bridge so it should be much faster there.

On older CPUs results will be mixed.

Which is the dataset you are testing with ? What CPU are you using ?
Also the SSE code is in addition to the while (ipt<matchlimit-(STEPSIZE-1)) {} 
loop.
It does not replace it.

Original comment by moinakg on 3 Jan 2013 at 5:32

GoogleCodeExporter commented 8 years ago

OK, after a few more tries and correction, i'm getting results consistent with 
your statements.

Gains are small, but definately there.

For example, using a Core i5 560 on silesia.tar, and compiling LZ4HC with 
VS2010,
speed increases from 16.5 MB/s to 16.7 MB/s

The improvement is more substantial when compiling with GCC v4.6.2, since speed 
increases from 17.1MB/s to 18.5 MB/s

I also checked enwik8 with GCC, and got an improvement from 18.2 MB/s to 18.7 
MB/s. 

So nothing earth-shattering, but definately an improvement.

Now, when it comes to integrating it to the main branches, i'm more concerned 
by the potential complexities it introduces.

Should the #define be automatic, or left to be triggered manually ? The second 
option looks less risky at this stage. To make the change automatic, it would 
be necessary to ensure that it works fine under all circumstances, which might 
be difficult to test.

There is also the issue with the Makefile.
It is necessary to enable SSE2 instructions to compile the intrinsic. 
The only way i could find is to add -msse2 to compiler options. 
Maybe there is a better method, since i'm not so sure if it is safe to let it 
there by default : what would be the behavior on a non-SSE2 CPU ?

Original comment by yann.col...@gmail.com on 4 Jan 2013 at 5:30

GoogleCodeExporter commented 8 years ago

Yes gains are small because your original code already parallelizes the check 
to 8 bytes. This one extends it to 16 bytes but incurs unaligned load penalty 
(which is there on Core i5). It will be interesting to test on Sandy Bridge and 
AMD Bulldozer. I happen to have a Bulldozer laptop at this time and will check 
the gains on that tomorrow.

If you use -msse2 on a non-sse cpu I think gcc will still generate sse2 code 
and the program will crash at runtime. I think it is best to make this manual 
at the moment. For the Makefile you can add a cmake rule.

Original comment by moinakg on 4 Jan 2013 at 6:38

GoogleCodeExporter commented 8 years ago

The results on the bulldozer laptop are similar:
Normal:
~/lz4-read-only $ ./lz4demo -b1 silesia.tar
*** Compression CLI using LZ4 algorithm , by Yann Collet (Jan  5 2013) ***
silesia.tar      : 206612480 ->  77219127 (37.37%),   12.0 MB/s ,  948.2 MB/s

SSE:
~/lz4-read-only $ ./lz4demo -b1 silesia.tar               
*** Compression CLI using LZ4 algorithm , by Yann Collet (Jan  5 2013) ***
silesia.tar      : 206612480 ->  77219127 (37.37%),   12.3 MB/s ,  947.8 MB/s  

Gains are small but there.

Original comment by moinakg on 5 Jan 2013 at 3:02

GoogleCodeExporter commented 8 years ago

Thanks for the confirmation test Moinak.

Original comment by yann.col...@gmail.com on 5 Jan 2013 at 5:27

GoogleCodeExporter commented 8 years ago

As far as I know the penalty for loading from unaligned adresses is really 
high. IMHO this should be integrated into lz4 but only if the format of the 
stream is changed. For instance for a special RLE compresser I wrote, I always 
write up to 8 2Byte-Headers, but only until the next 16Byte-alignment, before 
appending up to 8 blocks of data. In this case the data are RGB pixels which 
are merged with a reference image and expanded to ARGB (which about *4* times 
faster than using normal integers and also about 3 times faster than using 
unaligned SSE functions).
Oh and fyi: Thanks for developing the awesome lz4 lib :)

Original comment by leonard....@gmail.com on 4 Mar 2013 at 3:54

GoogleCodeExporter commented 8 years ago

Hi Leonard

Thanks for the tip.
The alignment performance impact greatly differ depending on CPU and 
instruction set.
For most "modern" CPU, such as Intel Core, loading unaligned data is mostly 
painless.
I however don't know if this statement remain true with regards to SSEx 
instructions. Considering your comment, it seems SSEx prefers aligned input by 
a wide margin.

Rgds

Original comment by yann.col...@gmail.com on 4 Mar 2013 at 4:03

GoogleCodeExporter commented 8 years ago

I know that the access to unaligned data has only a very little penalty, but 
that doesn't hold true for SSE instructions. Modern i5/i7 CPUs are a little 
faster in that regard, but in general unaligned SSE loads/writes are about 40% 
slower, if you access data within the cache boundary and up to 5 times slower 
if you don't. You can read more about it for instance here (I can personally 
confirm those numbers):
http://software.intel.com/en-us/articles/reducing-the-impact-of-misaligned-memor
y-accesses

P.S.: I forgot to say, that I'm using the "2Byte-Headers" as a cheap "padding" 
of the data-blocks, which are (of course) not always a multiple of 16 Bytes 
long.

Original comment by leonard....@gmail.com on 4 Mar 2013 at 8:32

GoogleCodeExporter commented 8 years ago

Thanks, that's very clear.

Now that you mention it, i wonder if it could make a difference regarding the 
results of Moinak during his xxHash vectorizing exploration :
http://moinakg.wordpress.com/2013/01/19/vectorizing-xxhash-for-fun-and-profit/

In substance, the SSE version of xxHash was found slower than the generic (ALU) 
one.

I would like to believe that loads were aligned, but just in case...

Original comment by yann.col...@gmail.com on 4 Mar 2013 at 11:07

GoogleCodeExporter commented 8 years ago

No those instructions are not aligned. You can easily recognize that, because 
instructions for unaligned access are usually in the form 
_mm_{cmd}u_{type}{width} (note the "u" after the command).
One doesn't need to run a benchmark to know that his first code sample will be 
much slower than the generic one. He is only using 4 of the 8 available 
registers (16! on a x86_64) and furthermore the loop is much to short for being 
effective on a modern CPU (what he calls "superscalar" or "Instruction Level 
Parallelism").
If you really never used SSE before, then this could be a little bit 
frustrating at the beginning, but in the end the result is often very rewarding.

Original comment by leonard....@gmail.com on 5 Mar 2013 at 12:17

GoogleCodeExporter commented 8 years ago

Interesting.
I'm interested in digging up the issue, since xxHash was created with SIMD 
instructions in mind. I would be glad to know if it does work well with such an 
instruction set.

Any place recommended around to learn how to use SSE instructions ?

Original comment by yann.col...@gmail.com on 5 Mar 2013 at 9:32

GoogleCodeExporter commented 8 years ago

Using SSE is mostly very straightforward: load from/write to aligned adresses 
if possible and use as many of the available xmm0-N registers as possible (I 
usually use the disassably output of my compiler to find how efficient the 
produced code is). The main problem you will have is that you now have to align 
your data and find the matching intrinsic/command.

http://download.intel.com/products/processor/manual/325462.pdf (14,7MB)
This is "Intel 64 and IA-32 Architectures Software Developer’s Manual 
Combined Volumes:1, 2A, 2B, 2C, 3A, 3B, and 3C".

In this pdf you will find *everything* you will ever need to know about SSE. On 
page 42 (2.2.7) you can get a general idea of the different types of SIMD 
instructions. Chapter 9 (MMX), 10 (SSE), 11 (SSE2) and 13 (SSE3, SSSE3 and 
SSE4) contain sub-chapters about "Data Types" and a detailed list of 
"Instructions". Furthermore starting with chapter 5.4 (page 114) you will also 
find a shorter list of all SIMD instructions.

Additionally you will find already a solution (and example) to most of your 
problems with SSE on http://stackoverflow.com/ (IMHO it's also the best place 
to learn about something as specific as SIMD instructions)

Original comment by leonard....@gmail.com on 5 Mar 2013 at 11:12

GoogleCodeExporter commented 8 years ago

Thanks !

Original comment by yann.col...@gmail.com on 5 Mar 2013 at 12:24

GoogleCodeExporter commented 8 years ago

This abstraction layer created and used for x264 encoder can do wonders in 
respect to SSE registers.

http://x264dev.multimedia.cx/archives/191

More asm material at http://www.agner.org/optimize/

Original comment by bruno...@gmail.com on 6 Mar 2013 at 7:07

GoogleCodeExporter commented 8 years ago

Many thanks Bruno, this looks an excellent find !

Original comment by yann.col...@gmail.com on 6 Mar 2013 at 7:54

GoogleCodeExporter commented 8 years ago

Although SSE instructions can provide some small benefits, 
those benefits seems not large enough for the time being to be worth 
maintaining into the "main" branch.

Conclusion may change in the future if someone finds a way to make SSE/AVX a 
better performer. In the meantime, I encourage the creation of forks to build, 
test and improve target-specific optimizations.

Original comment by yann.col...@gmail.com on 7 Apr 2014 at 10:33

Changed state: WontFix

Amrnasr / lz4

Using SSE intrinsics in LZ4HC #56