Closed GoogleCodeExporter closed 8 years ago
Thanks, it's an interesting improvement. I'll look into it.
Original comment by yann.col...@gmail.com
on 11 Dec 2012 at 10:14
Enhancement request.
Original comment by yann.col...@gmail.com
on 11 Dec 2012 at 2:44
Hi i've been working on your proposed speed improvement lately.
I'm getting into a few difficulties, so let's try to describe them here :
- __builtin_ctz() is a GCC-only function (and even then, only if version is >=
3.4).
- The following line is a bit cryptic to me :
_mm_movemask_epi8(_mm_cmpeq_epi8(span1, span2))
Here is what i could find about it :
- _mm_movemask_epi8 : Creates a 16-bit mask from the most significant bits of
the 16 signed or unsigned 8-bit integers in a and zero extends the upper bits
- _mm_cmpeq_epi8 : Compares the 16 signed or unsigned 8-bit integers in a and
the 16 signed or unsigned 8-bit integers in b for equality
So the first operation seems to compare the 128-bits fields byte-by-byte, and
return either a 0xFF if equal, or a 0x00 if not.
MS documentation Pseudo code :
r0 := (a0 == b0) ? 0xff : 0x0
r1 := (a1 == b1) ? 0xff : 0x0
(...)
Then the second operation grabs the upper bit of each byte comparison, which is
1 if equal, or 0 is not equal.
Consequently, the number of common bytes is the number of 1 starting from the
lowest bit.
But this should not work. There seems to be an inversion between 0 and 1, at
least by judging from MS pseudo code. We should get a 0 if equal, and a 1 if
not equal.
Original comment by yann.col...@gmail.com
on 3 Jan 2013 at 3:55
OK, i've made a few more tests.
I've created the function LZ4_NbCommonBits(), which should be the equivalent of
__builtin_ctz(), but more portable. It is based on existing function
LZ4_NbCommonBytes().
With this sole modification, i've run some tests with your proposed SSE
modifications.
It doesn't work well. The compression ratio is negatively affected, and the
speed is now 3 times slower. (it is easily possible to compare both versions by
turning on/off the #define).
Original comment by yann.col...@gmail.com
on 3 Jan 2013 at 4:23
First off:
mask = _mm_movemask_epi8(_mm_cmpeq_epi8(span1, span2)) ^ 0xffff;
Generates a 16-bit mask with one bit set to 0 for each equal byte in the span1
and span2 128bit XMM registers. This is done so that I could use
__builtin_ctz() (bsf instruction).
These are the results of my tests using the silesia corpus tarball:
======== Non SSE
time ./pcompress -c lz4 -l3 -s200m silesia.tar
Scaling to 1 thread
Checksum computed at 236.894 MB/s
Chunk compression speed 5.435 MB/s
real 0m14.569s
user 0m13.737s
sys 0m0.292s
ls -l silesia.tar.pz
-rw-rw-r--. 1 moinakg moinakg 77149484 Jan 3 22:32 silesia.tar.pz
========= SSE
time ./pcompress -c lz4 -l3 -s200m silesia.tar
Scaling to 1 thread
Checksum computed at 242.630 MB/s
Chunk compression speed 5.706 MB/s
real 0m13.895s
user 0m13.316s
sys 0m0.197s
ls -l silesia.tar.pz
-rw-rw-r--. 1 moinakg moinakg 77149484 Jan 3 22:37 silesia.tar.pz
These are quick-n-dirty time measures printed based on milliseconds spent in
the single iteration.
It is slightly faster and compression ratio is same. I am using a Core i5 M430,
2.27 GHz. The results can be faster if aligned loads were possible. That is
using _mm_load_si128() instead of _mm_loadu_si128(). However alignment cannot
be guaranteed there. Good thing is unaligned SSE read/write penalty virtually
disappears in Westmere2 and Sandy Bridge so it should be much faster there.
On older CPUs results will be mixed.
Which is the dataset you are testing with ? What CPU are you using ?
Also the SSE code is in addition to the while (ipt<matchlimit-(STEPSIZE-1)) {}
loop.
It does not replace it.
Original comment by moinakg
on 3 Jan 2013 at 5:32
OK, after a few more tries and correction, i'm getting results consistent with
your statements.
Gains are small, but definately there.
For example, using a Core i5 560 on silesia.tar, and compiling LZ4HC with
VS2010,
speed increases from 16.5 MB/s to 16.7 MB/s
The improvement is more substantial when compiling with GCC v4.6.2, since speed
increases from 17.1MB/s to 18.5 MB/s
I also checked enwik8 with GCC, and got an improvement from 18.2 MB/s to 18.7
MB/s.
So nothing earth-shattering, but definately an improvement.
Now, when it comes to integrating it to the main branches, i'm more concerned
by the potential complexities it introduces.
Should the #define be automatic, or left to be triggered manually ? The second
option looks less risky at this stage. To make the change automatic, it would
be necessary to ensure that it works fine under all circumstances, which might
be difficult to test.
There is also the issue with the Makefile.
It is necessary to enable SSE2 instructions to compile the intrinsic.
The only way i could find is to add -msse2 to compiler options.
Maybe there is a better method, since i'm not so sure if it is safe to let it
there by default : what would be the behavior on a non-SSE2 CPU ?
Original comment by yann.col...@gmail.com
on 4 Jan 2013 at 5:30
Yes gains are small because your original code already parallelizes the check
to 8 bytes. This one extends it to 16 bytes but incurs unaligned load penalty
(which is there on Core i5). It will be interesting to test on Sandy Bridge and
AMD Bulldozer. I happen to have a Bulldozer laptop at this time and will check
the gains on that tomorrow.
If you use -msse2 on a non-sse cpu I think gcc will still generate sse2 code
and the program will crash at runtime. I think it is best to make this manual
at the moment. For the Makefile you can add a cmake rule.
Original comment by moinakg
on 4 Jan 2013 at 6:38
The results on the bulldozer laptop are similar:
Normal:
~/lz4-read-only $ ./lz4demo -b1 silesia.tar
*** Compression CLI using LZ4 algorithm , by Yann Collet (Jan 5 2013) ***
silesia.tar : 206612480 -> 77219127 (37.37%), 12.0 MB/s , 948.2 MB/s
SSE:
~/lz4-read-only $ ./lz4demo -b1 silesia.tar
*** Compression CLI using LZ4 algorithm , by Yann Collet (Jan 5 2013) ***
silesia.tar : 206612480 -> 77219127 (37.37%), 12.3 MB/s , 947.8 MB/s
Gains are small but there.
Original comment by moinakg
on 5 Jan 2013 at 3:02
Thanks for the confirmation test Moinak.
Original comment by yann.col...@gmail.com
on 5 Jan 2013 at 5:27
As far as I know the penalty for loading from unaligned adresses is really
high. IMHO this should be integrated into lz4 but only if the format of the
stream is changed. For instance for a special RLE compresser I wrote, I always
write up to 8 2Byte-Headers, but only until the next 16Byte-alignment, before
appending up to 8 blocks of data. In this case the data are RGB pixels which
are merged with a reference image and expanded to ARGB (which about *4* times
faster than using normal integers and also about 3 times faster than using
unaligned SSE functions).
Oh and fyi: Thanks for developing the awesome lz4 lib :)
Original comment by leonard....@gmail.com
on 4 Mar 2013 at 3:54
Hi Leonard
Thanks for the tip.
The alignment performance impact greatly differ depending on CPU and
instruction set.
For most "modern" CPU, such as Intel Core, loading unaligned data is mostly
painless.
I however don't know if this statement remain true with regards to SSEx
instructions. Considering your comment, it seems SSEx prefers aligned input by
a wide margin.
Rgds
Original comment by yann.col...@gmail.com
on 4 Mar 2013 at 4:03
I know that the access to unaligned data has only a very little penalty, but
that doesn't hold true for SSE instructions. Modern i5/i7 CPUs are a little
faster in that regard, but in general unaligned SSE loads/writes are about 40%
slower, if you access data within the cache boundary and up to 5 times slower
if you don't. You can read more about it for instance here (I can personally
confirm those numbers):
http://software.intel.com/en-us/articles/reducing-the-impact-of-misaligned-memor
y-accesses
P.S.: I forgot to say, that I'm using the "2Byte-Headers" as a cheap "padding"
of the data-blocks, which are (of course) not always a multiple of 16 Bytes
long.
Original comment by leonard....@gmail.com
on 4 Mar 2013 at 8:32
Thanks, that's very clear.
Now that you mention it, i wonder if it could make a difference regarding the
results of Moinak during his xxHash vectorizing exploration :
http://moinakg.wordpress.com/2013/01/19/vectorizing-xxhash-for-fun-and-profit/
In substance, the SSE version of xxHash was found slower than the generic (ALU)
one.
I would like to believe that loads were aligned, but just in case...
Original comment by yann.col...@gmail.com
on 4 Mar 2013 at 11:07
No those instructions are not aligned. You can easily recognize that, because
instructions for unaligned access are usually in the form
_mm_{cmd}u_{type}{width} (note the "u" after the command).
One doesn't need to run a benchmark to know that his first code sample will be
much slower than the generic one. He is only using 4 of the 8 available
registers (16! on a x86_64) and furthermore the loop is much to short for being
effective on a modern CPU (what he calls "superscalar" or "Instruction Level
Parallelism").
If you really never used SSE before, then this could be a little bit
frustrating at the beginning, but in the end the result is often very rewarding.
Original comment by leonard....@gmail.com
on 5 Mar 2013 at 12:17
Interesting.
I'm interested in digging up the issue, since xxHash was created with SIMD
instructions in mind. I would be glad to know if it does work well with such an
instruction set.
Any place recommended around to learn how to use SSE instructions ?
Original comment by yann.col...@gmail.com
on 5 Mar 2013 at 9:32
Using SSE is mostly very straightforward: load from/write to aligned adresses
if possible and use as many of the available xmm0-N registers as possible (I
usually use the disassably output of my compiler to find how efficient the
produced code is). The main problem you will have is that you now have to align
your data and find the matching intrinsic/command.
http://download.intel.com/products/processor/manual/325462.pdf (14,7MB)
This is "Intel 64 and IA-32 Architectures Software Developer’s Manual
Combined Volumes:1, 2A, 2B, 2C, 3A, 3B, and 3C".
In this pdf you will find *everything* you will ever need to know about SSE. On
page 42 (2.2.7) you can get a general idea of the different types of SIMD
instructions. Chapter 9 (MMX), 10 (SSE), 11 (SSE2) and 13 (SSE3, SSSE3 and
SSE4) contain sub-chapters about "Data Types" and a detailed list of
"Instructions". Furthermore starting with chapter 5.4 (page 114) you will also
find a shorter list of all SIMD instructions.
Additionally you will find already a solution (and example) to most of your
problems with SSE on http://stackoverflow.com/ (IMHO it's also the best place
to learn about something as specific as SIMD instructions)
Original comment by leonard....@gmail.com
on 5 Mar 2013 at 11:12
Thanks !
Original comment by yann.col...@gmail.com
on 5 Mar 2013 at 12:24
This abstraction layer created and used for x264 encoder can do wonders in
respect to SSE registers.
http://x264dev.multimedia.cx/archives/191
More asm material at http://www.agner.org/optimize/
Original comment by bruno...@gmail.com
on 6 Mar 2013 at 7:07
Many thanks Bruno, this looks an excellent find !
Original comment by yann.col...@gmail.com
on 6 Mar 2013 at 7:54
Although SSE instructions can provide some small benefits,
those benefits seems not large enough for the time being to be worth
maintaining into the "main" branch.
Conclusion may change in the future if someone finds a way to make SSE/AVX a
better performer. In the meantime, I encourage the creation of forks to build,
test and improve target-specific optimizations.
Original comment by yann.col...@gmail.com
on 7 Apr 2014 at 10:33
Original issue reported on code.google.com by
moinakg
on 11 Dec 2012 at 5:47Attachments: