bkaradzic / go-lz4

Port of LZ4 lossless compression algorithm to Go
BSD 2-Clause "Simplified" License
211 stars 23 forks source link

Can't compress files larger than 2^17 bytes #3

Closed dgryski closed 10 years ago

dgryski commented 10 years ago

This is the size of the internal buffer, but there appear to be no checks to make sure we don't overrun it when we load data in in writer.go encoder.cache(). I haven't checked the reader, but it seems likely the same problem will occur in flush() if we try to write out a larger section that what we've read it.

bkaradzic commented 10 years ago

Thanks for report. I reproduced this bug, and spent some time today working on it (it's data that can't be compressed that triggers this bug), but unfortunately I have no ETA when it will be fixed. :(

dgryski commented 10 years ago

I'm not sure that incompressible data is the problem. The test case I was using that triggered it for me was 100 copies of the GPL in a single file, about 3.4M. That's just text and should compress fine.

I'll take a look at this tomorrow and maybe send a pull request if I get anywhere.

bkaradzic commented 10 years ago

I was compressing some video, that's why I thought it's uncompressable. But for example, executable of example (./lz4 lz4 test.lz4) even thought it's ~2MB compresses fine. This part here: https://github.com/bkaradzic/go-lz4/blob/master/writer.go#L198 should handle case when data is going to overflow internal buffer.

dgryski commented 10 years ago

For me it was blowing up in the for { err=e.cache(1); ... } just below.

dgryski commented 10 years ago

I've looked into it a bit more, I think it's a problem with the buffer size in general and the highly repetitive data. In this case, I think the multiple copies of the GPL are too good and it overflows looking for where the match ends. (Just a guess..)

dgryski commented 10 years ago

It did turn out to be the buffering issues. I have a patch set that removes all the buffering and changes the API to be more like the snappy-go one (that is, Encode(dst, src []byte) instead of a io.Reader/io.Writer ) and the crashes both on encoding and decoding went away. It's a pretty intrusive patch as I didn't make any effort to keep the streaming API at all. I can submit a pull request if you want to take a look, but I think the two problems you'll need to solve in your tree are: continuing to have a 'match' and overflowing the buffer (the err=e.cache(1) line, above), and on decoding in decode.cp() where you can potentially overflow when copying past the end of the buffer. It turns out that 100 copies of the GPL concatenated together is a good test for this because encode() keeps having bytes that match and just wants to slurp in as much as possible. On the decoding side, the potential problem is having to copy multiple GPL's from the input buffer, so it's a pointer/length combination where length overflows d.buf.

bkaradzic commented 10 years ago

Yeah you can submit pull request, not worried about breakage of streaming part. LZ4 now has streaming specs which didn't exist back when I worked on this (and my version doesn't really work :).

dgryski commented 10 years ago

Filed pull request #4 to fix the buffering issues.

bkaradzic commented 10 years ago

Thanks! :)

dgryski commented 10 years ago

Closing, as my pull request handles all the problems I had identified.