bkaradzic / go-lz4

Port of LZ4 lossless compression algorithm to Go
BSD 2-Clause "Simplified" License
211 stars 23 forks source link

optimized cp(), 2x performance increase for large files #5

Closed zhenjl closed 10 years ago

zhenjl commented 10 years ago

AFAICT, cp() needs to append byte by byte because initially there may not be enough in the dst to cover the length requested. But once there's enough in dst, you can just append a whole slice.

dgryski commented 10 years ago

This is basically the optimization I had hinted at in my comment at the top. The other one was breaking the append into two pieces, the one we have enough room for, and then the part we don't have room for, and we can do the the bit we have room for first, and then go byte-by-byte.

zhenjl commented 10 years ago

@dgryski yep that would work too. I took the lazy route. :)

dgryski commented 10 years ago

Hrm, I actually get a performance loss with this commit:

BenchmarkLZ4Decode       500       6857347 ns/op
BenchmarkLZ4Decode       500       5422178 ns/op

This is a pretty dumb benchmark, just decompressing the copyright notice copied 100 times into a string.

zhenjl commented 10 years ago

That might actually be the case because the overhead of size checking. I would not be surprised about that.

dgryski commented 10 years ago

This patch gives a much nicer speedup, but I'm not sure how generally applicable it is. It replaces all the calls to append() with a single call to append an empty array, and then does the copy loop:

diff --git a/reader.go b/reader.go
index ec84519..36fadba 100644
--- a/reader.go
+++ b/reader.go
@@ -93,8 +93,11 @@ func (d *decoder) readUint16() (uint16, error) {
 func (d *decoder) cp(length, decr uint32) {
        // can't use copy here, but could probably optimize the appends
        if int(d.ref)+int(length)-len(d.dst) > 0 {
+               n := make([]byte, length)
+               l := uint32(len(d.dst))
+               d.dst = append(d.dst, n...)
                for ii := uint32(0); ii < length; ii++ {
-                       d.dst = append(d.dst, d.dst[d.ref+ii])
+                       d.dst[l+ii] = d.dst[d.ref+ii]
                }
        } else {
                d.dst = append(d.dst, d.dst[d.ref:d.ref+length]...)

For the record, this particular one seems to be by far the fastest on the artificial benchmark:

BenchmarkLZ4Decode      1000       2555156 ns/op

I'm going to play around with it on some real workloads and see what happens.

dgryski commented 10 years ago

Indeed, the performance difference is much less when presented with 'real' data, in this case 10M of source code:

BenchmarkLZ4Decode        10     195941345 ns/op     old
BenchmarkLZ4Decode        10     194035834 ns/op     new
dgryski commented 10 years ago

More weird results (1M of source code):

BenchmarkLZ4Decode       100      19878759 ns/op   old
BenchmarkLZ4Decode       100      21183895 ns/op   new

For this change (which I would have assumed would have been an improvement):

diff --git a/reader.go b/reader.go
index ec84519..f55b18e 100644
--- a/reader.go
+++ b/reader.go
@@ -105,15 +105,14 @@ func (d *decoder) cp(length, decr uint32) {

 func (d *decoder) consume(length uint32) error {

-       for ii := uint32(0); ii < length; ii++ {
-               by, err := d.readByte()
-               if err != nil {
-                       return ErrCorrupt
-               }
-               d.dst = append(d.dst, by)
-               d.dpos++
+       if int(d.spos+length) > len(d.src) {
+               return ErrCorrupt
        }

+       d.dst = append(d.dst, d.src[d.spos:d.spos+length]...)
+       d.dpos += length
+       d.spos += length
+
        return nil
 }

This also explains why your above change was slower. It's not the bounds check, it must be the call to append() (even though it should be doing less work..)

Perhaps there are processor cache effects at work here.

zhenjl commented 10 years ago

I could be wrong, but I found that the "then" condition is only executed a few times during decoding. So optimizing that may not change the overall time too much if presented with a sufficiently large file.

dgryski commented 10 years ago

Indeed, and we can get an improvement by moving the uncommon case to the else block so the processor correctly assumes that it's unlikely to happen. Looks like we can get ~2% speed up.

BenchmarkLZ4Decode       100      19638405 ns/op    'then' block is likely
BenchmarkLZ4Decode       100      20020073 ns/op    'else' block is likely
zhenjl commented 10 years ago

Excellent!

Btw, in case you are interested, here are my pprofs for encode/decode

Encode

     599  57.7%  57.7%     1037  99.9% main.Encode
     237  22.8%  80.5%      237  22.8% main.(*encoder).readUint32
     184  17.7%  98.3%      201  19.4% main.(*encoder).writeLiterals
      17   1.6%  99.9%       17   1.6% runtime.memmove

Decode

     183  28.6%  28.6%      638  99.8% main.Decode
     153  23.9%  52.6%      277  43.3% main.(*decoder).cp
     144  22.5%  75.1%      144  22.5% main.(*decoder).readUint16
     124  19.4%  94.5%      124  19.4% runtime.memmove
      34   5.3%  99.8%       34   5.3% main.(*decoder).consume

I briefly looked at the hot spots and nothing jumped out at me as an easy fix. :)