bkaradzic / go-lz4

Port of LZ4 lossless compression algorithm to Go
BSD 2-Clause "Simplified" License
211 stars 23 forks source link

Incompatibility with official LZ4 #15

Closed vhbit closed 9 years ago

vhbit commented 9 years ago

Setup:

The problem I have is that sometimes data from backend couldn't be decoded on client. The keyword is "sometimes". I've got one sample (55K compressed, 114K uncompressed).

I'm not sure yet which side contains bug but let me know if you want to check it out.

dgryski commented 9 years ago

Yes, this library should always maintain compatibility with the C version. It's great you have a test case. You can email them to me at my username at gmail.

vhbit commented 9 years ago

Thanks, sent.

vhbit commented 9 years ago

Here it goes (copy from lz4.c ):

        /* copy literals */
        cpy = op+length;
        if (((endOnInput) && ((cpy>(partialDecoding?oexit:oend-MFLIMIT)) || (ip+length>iend-(2+1+LASTLITERALS))) )
            || ((!endOnInput) && (cpy>oend-COPYLENGTH)))
        {
            if (partialDecoding)
            {
                if (cpy > oend)
                    goto _output_error;                           /* Error : write attempt beyond end of output buffer */
                if ((endOnInput) && (ip+length > iend))
                    goto _output_error;   /* Error : read attempt beyond end of input buffer */
            }
            else
            {
                if ((!endOnInput) && (cpy != oend))
                    goto _output_error;       /* Error : block decoding must stop exactly there */
                if ((endOnInput) && ((ip+length != iend) || (cpy > oend)))
                    goto _output_error;   /* Error : input must be consumed */
            }
            memcpy(op, ip, length);
            ip += length;
            op += length;
            break;     /* Necessarily EOF, due to parsing restrictions */
        }
        LZ4_WILDCOPY(op, ip, cpy); ip -= (op-cpy); op = cpy;

It fails in this block:

                if ((endOnInput) && ((ip+length != iend) || (cpy > oend)))
                    goto _output_error;   /* Error : input must be consumed */

and the reason is that because of 2 conditions are true (ip+length>iend-(2+1+LASTLITERALS)) and (ip+length != iend)

It seems that go-lz4 doesn't respect a rule regarding last literals:

There are specific parsing rules to respect to be compatible with the reference decoder : 1) The last 5 bytes are always literals 2) The last match cannot start within the last 12 bytes

Hope this helps.

dgryski commented 9 years ago

This is a fantastic help! I'll get this patched this weekend.

dgryski commented 9 years ago

Hi,

Using the following trivial unpacker ( https://gist.github.com/dgryski/dee804f1e448898238d5 ), I agree the gofmt.lz4 example you sent me fails to unpack. However, when I compress the provided sample input file with the current revision of go-lz4, I am able to unpack the resulting output file. What sha1 is your go-lz4 repository at? Are you using any other special flags?

(I have a small patch which ensures the output follows the two above restrictions, but I need to make sure the current version fails and that my patch then actually works.)

vhbit commented 9 years ago

Considering deps it is the latest commit. I'm not sure if there are any specific build options.

It's quite interesting may it be a misuse of the library?

dgryski commented 9 years ago

I'll look into how syncthing is using go-lz4.

dgryski commented 9 years ago

Ah, figured out my problem with testing. The compressed file and the uncompressed file you sent me are not the same data. If I uncompress gofmt.lz4, that results in a file that I can compress with go-lz4 that cannot be unpacked with the reference C implementation.

dgryski commented 9 years ago

However, I haven't figured out the patch to go-lz4 yet to fix this :/

dgryski commented 9 years ago

Here's what I have. It passes my test, and I think fixes the code to abide by the two rules you pasted above.

I'd love somebody else to have a look at this though before I push it.

@bkaradzic ?

diff --git a/writer.go b/writer.go
index 9e8a3c5..40ebaed 100644
--- a/writer.go
+++ b/writer.go
@@ -121,7 +121,7 @@ func Encode(dst, src []byte) ([]byte, error) {
        )

        for {
-               if int(e.pos)+4 >= len(e.src) {
+               if int(e.pos)+12 >= len(e.src) {
                        e.writeLiterals(uint32(len(e.src))-e.anchor, 0, e.anchor)
                        return e.dst[:e.dpos], nil
                }
@@ -158,7 +158,7 @@ func Encode(dst, src []byte) ([]byte, error) {
                ref += minMatch
                e.anchor = e.pos

-               for int(e.pos) < len(e.src) && e.src[e.pos] == e.src[ref] {
+               for int(e.pos) < len(e.src)-5 && e.src[e.pos] == e.src[ref] {
                        e.pos++
                        ref++
                }
bkaradzic commented 9 years ago

Yeah seems fine. Btw, incompatibility might come from original version I used for porting and this newer version of lz4. The only incompatibility I created was magic+size at beginning of file since original go-lz4 meant to be streaming.