Closed GoogleCodeExporter closed 8 years ago
Hi Valery
I tried this trick in the past, but it resulted in lowered performance.
My assumption was that just "remembering" where the check ended costs at least
one more register, which possibly could cause some concerns for the compiler
optimizations.
Now, since you apparently where successful in applying this trick into your own
algorithm, I shall have a look at it again ;)
Original comment by yann.col...@gmail.com
on 21 Jan 2014 at 9:54
I made a quick test, following your suggestion, and got an instant 5%
compression speed boost. Definately a nice trick of yours to add for the next
version.
I don't remember exactly what I did in my earlier failed attempts, I guess I've
been tracking the pointer instead of the delta. Anyway, this time I've
following more closely your suggestion, and it works.
Original comment by yann.col...@gmail.com
on 21 Jan 2014 at 10:04
Original comment by yann.col...@gmail.com
on 21 Jan 2014 at 10:05
Original comment by yann.col...@gmail.com
on 21 Jan 2014 at 10:05
I too see lack of registers as the only possible drawback, so this may work
with a simple algorithm but not with a more powerful one, and work in
64-bit but not in 32-bit mode.
Original comment by valery.c...@gmail.com
on 21 Jan 2014 at 10:12
Indeed, checked again,
the boost is only valid in 64-bits mode.
32-bits speed remains surprisingly stable, no gain, no pain.
Original comment by yann.col...@gmail.com
on 21 Jan 2014 at 10:15
I made some new tests using a different platform (Core i5, instead of Core2Duo).
The results are different.
Using the "back" variable results in a measurable loss in 64-bits mode.
And, surprisingly, the 32-bits version gets a small improvement.
It's kinda reverse situation from the Core 2 Duo.
I also tested Visual, and although I did not made thorough tests, let's say
preliminary results are not good : it's a small loss for both 32 & 64 bits.
Therefore the situation for this improvement is not so clear...
Original comment by yann.col...@gmail.com
on 23 Jan 2014 at 7:01
As a light note,
I should note that I noticed an reproducible decompression speed improvement
while implementing the "back" trick, on both visual and gcc, most visibly on
64-bits.
Which is obviously nonsense, since the modification only impact the source code
of the compression function.
Well, on the mystery of compiler optimizations...
Original comment by yann.col...@gmail.com
on 23 Jan 2014 at 7:24
To avoid the segfault, valery, a third step is to correct the "anchor" so that
the later "length" value becomes correct again. One way to do it is in the
attached diff. I assume your second step wanted to bump up by (back+MINMATCH),
eh?
For my lz4-r116 variant, which implements 8 faster-than-default modes, some
modes saw a speed decrease, while many modes saw 2-3% speed increase with the
"back trick"
For me in a compression mode similar to lz4-r116 (probably an important one!)
lz4-r116-mine 330 MB/s (BUT a few bytes larger compressed file)
lz4-r116-mine+back trick 324 MB/s (1.8% worse, after trying many ways)
I was unable to remove the speed decrease for the highest-compression mode of
lz4-mine, which approximates the compression of lz4-vanilla.
Attached "back-trick" patch on lz4 (algorithm should match svn version)
lz4-r116-vanilla 328 MB/s
lz4-r116-vanilla+back trick 292 MB/s (attached diff, 10% worse --- OUCH!)
Shoving length, forwardIp and findMatchAttempts into small local scopes got no
improvement on lz4-vanilla. I didn't try hard to make lz4-vanilla+back trick
get faster.
My tests were on an Intel core i7 (cpuid says Version 000106e5, model 14,
stepping 5)
i) I wonder if the compression speed decrease Yann got for core i5 was
similarly horrific, O(10%) ?
ii) Maybe there's an alternate back trick that works better for core i5/i7?
iii) Is there a nice way to introduce cpu-specific lz4 optimizations? Maybe
something like Agner Fog's init functions for a dispatch table to specific
optimized versions? I can only think of sorta' ugly answers myself. But if
possible, could also be used for some of the SSE proposals for lz4hc.
I'm keeping my diffs around, just in case I need to use it projects where the
CPU is well-known, but I'm not putting it into my production code for now.
Original comment by ejkr...@gmail.com
on 7 Apr 2014 at 10:14
Attachments:
Yes, back trick proved, on average, to be more detrimental than advantageous,
although your mileage can vary depending on compiler and CPU type.
Anyway, since it's not a clear win, it's not exactly a good candidate.
> Is there a nice way to introduce cpu-specific lz4 optimizations?
I would say yes, but not for that kind of benefits.
I will certainly accept some system-specific trick which provides a 50%
performance boost to a reasonably accessible target, but 2% ? no. Not even 5%.
It's not worth the added cost of maintenance and complexity. It really needs to
pay off.
Which by the way is exactly the kind of improvement Valery provide with his
first patch, targeting GCC x86 32 bits specifically, and providing a massive
performance boost for this target.
Original comment by yann.col...@gmail.com
on 7 Apr 2014 at 10:27
Original comment by yann.col...@gmail.com
on 7 Apr 2014 at 10:28
Original issue reported on code.google.com by
valery.c...@gmail.com
on 21 Jan 2014 at 8:52