Optimize LZ4_NbCommonBytes usage

GoogleCodeExporter commented 8 years ago

Line 467 of lz4.c is:
while ((ip>anchor) && (ref > lowLimit) && unlikely(ip[-1]==ref[-1])) { ip--; 
ref--; }

After finding a 4-byte match, LZ4 checks whether the match actually starts 
earlier, moving the pointers back when it happens.
Later, before calling LZ4_NbCommonBytes on these pointers, it first skips the 
first 4 bytes, which are already checked.
This is a good optimization, but it could go further. When pointers really did 
go back 1 time at line 467, we have already checked 5 bytes instead of 4, and 
should skip them all.

So line 467 could be:
int back = 0;
while ((ip>anchor) && (ref > lowLimit) && unlikely(ip[-1]==ref[-1])) { ip--; 
ref--; back++; }

And later before calling LZ4_NbCommonBytes, we can skip some more bytes:
{ ip += back; ref += back; }

LZV uses this optimization with success. I failed to apply it to LZ4 (always 
get a broken stream), but with a better knowledge of the code base, it should 
be possible to use the same trick.

Original issue reported on code.google.com by valery.c...@gmail.com on 21 Jan 2014 at 8:52

GoogleCodeExporter commented 8 years ago

Hi Valery

I tried this trick in the past, but it resulted in lowered performance.
My assumption was that just "remembering" where the check ended costs at least 
one more register, which possibly could cause some concerns for the compiler 
optimizations.

Now, since you apparently where successful in applying this trick into your own 
algorithm, I shall have a look at it again ;)

Original comment by yann.col...@gmail.com on 21 Jan 2014 at 9:54

GoogleCodeExporter commented 8 years ago

I made a quick test, following your suggestion, and got an instant 5% 
compression speed boost. Definately a nice trick of yours to add for the next 
version.

I don't remember exactly what I did in my earlier failed attempts, I guess I've 
been tracking the pointer instead of the delta. Anyway, this time I've 
following more closely your suggestion, and it works.

Original comment by yann.col...@gmail.com on 21 Jan 2014 at 10:04

GoogleCodeExporter commented 8 years ago

Original comment by yann.col...@gmail.com on 21 Jan 2014 at 10:05

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 8 years ago

Original comment by yann.col...@gmail.com on 21 Jan 2014 at 10:05

Changed state: Accepted
Added labels: Priority-Medium
Removed labels: Priority-minor

GoogleCodeExporter commented 8 years ago

I too see lack of registers as the only possible drawback, so this may work
with a simple algorithm but not with a more powerful one, and work in
64-bit but not in 32-bit mode.

Original comment by valery.c...@gmail.com on 21 Jan 2014 at 10:12

GoogleCodeExporter commented 8 years ago

Indeed, checked again, 
the boost is only valid in 64-bits mode.
32-bits speed remains surprisingly stable, no gain, no pain.

Original comment by yann.col...@gmail.com on 21 Jan 2014 at 10:15

GoogleCodeExporter commented 8 years ago

I made some new tests using a different platform (Core i5, instead of Core2Duo).
The results are different.
Using the "back" variable results in a measurable loss in 64-bits mode.
And, surprisingly, the 32-bits version gets a small improvement.
It's kinda reverse situation from the Core 2 Duo.

I also tested Visual, and although I did not made thorough tests, let's say 
preliminary results are not good : it's a small loss for both 32 & 64 bits.

Therefore the situation for this improvement is not so clear...

Original comment by yann.col...@gmail.com on 23 Jan 2014 at 7:01

GoogleCodeExporter commented 8 years ago

As a light note,
I should note that I noticed an reproducible decompression speed improvement 
while implementing the "back" trick, on both visual and gcc, most visibly on 
64-bits.
Which is obviously nonsense, since the modification only impact the source code 
of the compression function.

Well, on the mystery of compiler optimizations...

Original comment by yann.col...@gmail.com on 23 Jan 2014 at 7:24

GoogleCodeExporter commented 8 years ago

To avoid the segfault, valery, a third step is to correct the "anchor" so that 
the later "length" value becomes correct again.  One way to do it is in the 
attached diff.  I assume your second step wanted to bump up by (back+MINMATCH), 
eh? 

For my lz4-r116 variant, which implements 8 faster-than-default modes, some 
modes saw a speed decrease, while many modes saw 2-3% speed increase with the 
"back trick"
  For me in a compression mode similar to lz4-r116 (probably an important one!)
     lz4-r116-mine             330 MB/s    (BUT a few bytes larger compressed file)
     lz4-r116-mine+back trick  324 MB/s (1.8% worse, after trying many ways)
I was unable to remove the speed decrease for the highest-compression mode of 
lz4-mine, which approximates the compression of lz4-vanilla.

Attached "back-trick" patch on lz4 (algorithm should match svn version)
     lz4-r116-vanilla            328 MB/s
     lz4-r116-vanilla+back trick 292 MB/s (attached diff, 10% worse --- OUCH!)
Shoving length, forwardIp and findMatchAttempts into small local scopes got no 
improvement on lz4-vanilla.  I didn't try hard to make lz4-vanilla+back trick
get faster.

My tests were on an Intel core i7 (cpuid says Version 000106e5, model 14, 
stepping 5)

i) I wonder if the compression speed decrease Yann got for core i5 was 
similarly horrific, O(10%) ?
ii) Maybe there's an alternate back trick that works better for core i5/i7?
iii) Is there a nice way to introduce cpu-specific lz4 optimizations?  Maybe 
something like Agner Fog's init functions for a dispatch table to specific
optimized versions?  I can only think of sorta' ugly answers myself.  But if 
possible, could also be used for some of the SSE proposals for lz4hc.

I'm keeping my diffs around, just in case I need to use it projects where the 
CPU is well-known, but I'm not putting it into my production code for now.

Original comment by ejkr...@gmail.com on 7 Apr 2014 at 10:14

Attachments:

lz4-fwdMatch.diff

GoogleCodeExporter commented 8 years ago

Yes, back trick proved, on average, to be more detrimental than advantageous, 
although your mileage can vary depending on compiler and CPU type.

Anyway, since it's not a clear win, it's not exactly a good candidate.

> Is there a nice way to introduce cpu-specific lz4 optimizations?

I would say yes, but not for that kind of benefits.
I will certainly accept some system-specific trick which provides a 50% 
performance boost to a reasonably accessible target, but 2% ? no. Not even 5%. 
It's not worth the added cost of maintenance and complexity. It really needs to 
pay off.

Which by the way is exactly the kind of improvement Valery provide with his 
first patch, targeting GCC x86 32 bits specifically, and providing a massive 
performance boost for this target.

Original comment by yann.col...@gmail.com on 7 Apr 2014 at 10:27

GoogleCodeExporter commented 8 years ago

Original comment by yann.col...@gmail.com on 7 Apr 2014 at 10:28

Changed state: WontFix

Amrnasr / lz4

Optimize LZ4_NbCommonBytes usage #110