LZ4 decompression for unknown original size

ph1lm commented 8 months ago

Is your feature request related to a problem? Please describe. If we use LZ4 compression for replication then we should adjust a client code to decompress values. To decompress a value of unknown length, we have to reserve a large buffer since we don't know the original size of a value or we have to have tricky logic of increasing this buffer incrementally if a decompressed value doesn't fit in.

Describe the solution you'd like Write the original uncompressed size (maxCompressedLength) as a leading integer in the compressed value byte array. In this case, we will know the exact size of a buffer we should use for decompression on a client size. Or just use LZ4CompressorWithLength. Similar to this.

Describe alternatives you've considered I'd also appreciate any solution of how to decompress a value of unknown length with lz4-java in an efficient way.

ph1lm commented 8 months ago

Thanks @nwheeler81 for fixing it so quickly!

I think we may want to change

val sizePrefix = ByteBuffer.allocate(4).putInt(input.length).array()

to

val sizePrefix = ByteBuffer.allocate(4).putInt(inputBytes.length).array()

otherwise, the size will be wrong for strings with Unicode chars that take >1 byte.

Also, any reason to not use LZ4CompressorWithLength?

nwheeler81 commented 8 months ago

@ph1lm thanks for pointing out the bug. is this feature supported among other LZ4 ports, e.g. python-lz4?

ph1lm commented 8 months ago

@nwheeler81 yes, it's readable with python-lz4

This is the code I've used to test it

In Java, to generate lz4 file:

public class Main {
  private static final String STR = Strings.repeat("test", 32);

  public static void main(String[] args) throws IOException {
    LZ4Factory factory = LZ4Factory.fastestInstance();
    LZ4Compressor compressor = factory.fastCompressor();
    LZ4CompressorWithLength compressorWithLength = new LZ4CompressorWithLength(compressor);
    byte[] compressed = compressorWithLength.compress(STR.getBytes(StandardCharsets.UTF_8));
    Files.write(Paths.get("/tmp/test.gz"), compressed);
  }
}

and in python3, to read and decompress it:

import lz4.block

f = open('/tmp/test.gz', 'rb')
b = f.read()
f.close()

str = lz4.block.decompress(b)

print(str.decode('utf-8'))
print(int.from_bytes(b[:4], 'little'))

and it gives me

python3 test.py     

testtesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttest
128

Note that we don't even need to skip the first 4 bytes when using python-lz4. It uses the same schema to store uncompressed size as java-lz4

On the other hand, Golang and its lz4 module doesn't have this feature. So you'll have to deal with it manually:

import (
    "encoding/binary"
    "github.com/pierrec/lz4"
    "os"
)

func main() {
    s, err := os.ReadFile("/tmp/test.gz")
    if err != nil {
        panic(err)
    }

    size := binary.LittleEndian.Uint32(s[:4])

    var d = make([]byte,size)

    _, err = lz4.UncompressBlock(s[4:], d)
    if err != nil {
        panic(err)
    }

    println(string(d))
}

So, IMHO, it's safe to switch your code to LZ4CompressorWithLength. But I'd also provide more details in the README about how the length is stored. In particular, I'd mention that it's int32 with little-endian ordering - just to help people to decode it correctly if their lz4 lib doesn't support it.

nwheeler81 commented 8 months ago

@ph1lm LGTM, I will use LZ4CompressorWithLength

aws-samples / cql-replicator

LZ4 decompression for unknown original size #125