Closed dgryski closed 11 years ago
Thanks for report. I reproduced this bug, and spent some time today working on it (it's data that can't be compressed that triggers this bug), but unfortunately I have no ETA when it will be fixed. :(
I'm not sure that incompressible data is the problem. The test case I was using that triggered it for me was 100 copies of the GPL in a single file, about 3.4M. That's just text and should compress fine.
I'll take a look at this tomorrow and maybe send a pull request if I get anywhere.
I was compressing some video, that's why I thought it's uncompressable. But for example, executable of example (./lz4 lz4 test.lz4) even thought it's ~2MB compresses fine. This part here: https://github.com/bkaradzic/go-lz4/blob/master/writer.go#L198 should handle case when data is going to overflow internal buffer.
For me it was blowing up in the for { err=e.cache(1); ... } just below.
I've looked into it a bit more, I think it's a problem with the buffer size in general and the highly repetitive data. In this case, I think the multiple copies of the GPL are too good and it overflows looking for where the match ends. (Just a guess..)
It did turn out to be the buffering issues. I have a patch set that removes all the buffering and changes the API to be more like the snappy-go one (that is, Encode(dst, src []byte) instead of a io.Reader/io.Writer ) and the crashes both on encoding and decoding went away. It's a pretty intrusive patch as I didn't make any effort to keep the streaming API at all. I can submit a pull request if you want to take a look, but I think the two problems you'll need to solve in your tree are: continuing to have a 'match' and overflowing the buffer (the err=e.cache(1) line, above), and on decoding in decode.cp() where you can potentially overflow when copying past the end of the buffer. It turns out that 100 copies of the GPL concatenated together is a good test for this because encode() keeps having bytes that match and just wants to slurp in as much as possible. On the decoding side, the potential problem is having to copy multiple GPL's from the input buffer, so it's a pointer/length combination where length overflows d.buf.
Yeah you can submit pull request, not worried about breakage of streaming part. LZ4 now has streaming specs which didn't exist back when I worked on this (and my version doesn't really work :).
Filed pull request #4 to fix the buffering issues.
Thanks! :)
Closing, as my pull request handles all the problems I had identified.
This is the size of the internal buffer, but there appear to be no checks to make sure we don't overrun it when we load data in in writer.go encoder.cache(). I haven't checked the reader, but it seems likely the same problem will occur in flush() if we try to write out a larger section that what we've read it.