Closed zhenjl closed 11 years ago
This is basically the optimization I had hinted at in my comment at the top. The other one was breaking the append into two pieces, the one we have enough room for, and then the part we don't have room for, and we can do the the bit we have room for first, and then go byte-by-byte.
@dgryski yep that would work too. I took the lazy route. :)
Hrm, I actually get a performance loss with this commit:
BenchmarkLZ4Decode 500 6857347 ns/op
BenchmarkLZ4Decode 500 5422178 ns/op
This is a pretty dumb benchmark, just decompressing the copyright notice copied 100 times into a string.
That might actually be the case because the overhead of size checking. I would not be surprised about that.
This patch gives a much nicer speedup, but I'm not sure how generally applicable it is. It replaces all the calls to append() with a single call to append an empty array, and then does the copy loop:
diff --git a/reader.go b/reader.go
index ec84519..36fadba 100644
--- a/reader.go
+++ b/reader.go
@@ -93,8 +93,11 @@ func (d *decoder) readUint16() (uint16, error) {
func (d *decoder) cp(length, decr uint32) {
// can't use copy here, but could probably optimize the appends
if int(d.ref)+int(length)-len(d.dst) > 0 {
+ n := make([]byte, length)
+ l := uint32(len(d.dst))
+ d.dst = append(d.dst, n...)
for ii := uint32(0); ii < length; ii++ {
- d.dst = append(d.dst, d.dst[d.ref+ii])
+ d.dst[l+ii] = d.dst[d.ref+ii]
}
} else {
d.dst = append(d.dst, d.dst[d.ref:d.ref+length]...)
For the record, this particular one seems to be by far the fastest on the artificial benchmark:
BenchmarkLZ4Decode 1000 2555156 ns/op
I'm going to play around with it on some real workloads and see what happens.
Indeed, the performance difference is much less when presented with 'real' data, in this case 10M of source code:
BenchmarkLZ4Decode 10 195941345 ns/op old
BenchmarkLZ4Decode 10 194035834 ns/op new
More weird results (1M of source code):
BenchmarkLZ4Decode 100 19878759 ns/op old
BenchmarkLZ4Decode 100 21183895 ns/op new
For this change (which I would have assumed would have been an improvement):
diff --git a/reader.go b/reader.go
index ec84519..f55b18e 100644
--- a/reader.go
+++ b/reader.go
@@ -105,15 +105,14 @@ func (d *decoder) cp(length, decr uint32) {
func (d *decoder) consume(length uint32) error {
- for ii := uint32(0); ii < length; ii++ {
- by, err := d.readByte()
- if err != nil {
- return ErrCorrupt
- }
- d.dst = append(d.dst, by)
- d.dpos++
+ if int(d.spos+length) > len(d.src) {
+ return ErrCorrupt
}
+ d.dst = append(d.dst, d.src[d.spos:d.spos+length]...)
+ d.dpos += length
+ d.spos += length
+
return nil
}
This also explains why your above change was slower. It's not the bounds check, it must be the call to append() (even though it should be doing less work..)
Perhaps there are processor cache effects at work here.
I could be wrong, but I found that the "then" condition is only executed a few times during decoding. So optimizing that may not change the overall time too much if presented with a sufficiently large file.
Indeed, and we can get an improvement by moving the uncommon case to the else block so the processor correctly assumes that it's unlikely to happen. Looks like we can get ~2% speed up.
BenchmarkLZ4Decode 100 19638405 ns/op 'then' block is likely
BenchmarkLZ4Decode 100 20020073 ns/op 'else' block is likely
Excellent!
Btw, in case you are interested, here are my pprofs for encode/decode
Encode
599 57.7% 57.7% 1037 99.9% main.Encode
237 22.8% 80.5% 237 22.8% main.(*encoder).readUint32
184 17.7% 98.3% 201 19.4% main.(*encoder).writeLiterals
17 1.6% 99.9% 17 1.6% runtime.memmove
Decode
183 28.6% 28.6% 638 99.8% main.Decode
153 23.9% 52.6% 277 43.3% main.(*decoder).cp
144 22.5% 75.1% 144 22.5% main.(*decoder).readUint16
124 19.4% 94.5% 124 19.4% runtime.memmove
34 5.3% 99.8% 34 5.3% main.(*decoder).consume
I briefly looked at the hot spots and nothing jumped out at me as an easy fix. :)
AFAICT, cp() needs to append byte by byte because initially there may not be enough in the dst to cover the length requested. But once there's enough in dst, you can just append a whole slice.