Closed ph1lm closed 7 months ago
Thanks @nwheeler81 for fixing it so quickly!
I think we may want to change
val sizePrefix = ByteBuffer.allocate(4).putInt(input.length).array()
to
val sizePrefix = ByteBuffer.allocate(4).putInt(inputBytes.length).array()
otherwise, the size will be wrong for strings with Unicode chars that take >1 byte.
Also, any reason to not use LZ4CompressorWithLength
?
@ph1lm thanks for pointing out the bug. is this feature supported among other LZ4 ports, e.g. python-lz4?
@nwheeler81 yes, it's readable with python-lz4
This is the code I've used to test it
In Java, to generate lz4 file:
public class Main {
private static final String STR = Strings.repeat("test", 32);
public static void main(String[] args) throws IOException {
LZ4Factory factory = LZ4Factory.fastestInstance();
LZ4Compressor compressor = factory.fastCompressor();
LZ4CompressorWithLength compressorWithLength = new LZ4CompressorWithLength(compressor);
byte[] compressed = compressorWithLength.compress(STR.getBytes(StandardCharsets.UTF_8));
Files.write(Paths.get("/tmp/test.gz"), compressed);
}
}
and in python3, to read and decompress it:
import lz4.block
f = open('/tmp/test.gz', 'rb')
b = f.read()
f.close()
str = lz4.block.decompress(b)
print(str.decode('utf-8'))
print(int.from_bytes(b[:4], 'little'))
and it gives me
python3 test.py
testtesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttesttest
128
Note that we don't even need to skip the first 4 bytes when using python-lz4
.
It uses the same schema to store uncompressed size as java-lz4
On the other hand, Golang and its lz4 module doesn't have this feature. So you'll have to deal with it manually:
import (
"encoding/binary"
"github.com/pierrec/lz4"
"os"
)
func main() {
s, err := os.ReadFile("/tmp/test.gz")
if err != nil {
panic(err)
}
size := binary.LittleEndian.Uint32(s[:4])
var d = make([]byte,size)
_, err = lz4.UncompressBlock(s[4:], d)
if err != nil {
panic(err)
}
println(string(d))
}
So, IMHO, it's safe to switch your code to LZ4CompressorWithLength
.
But I'd also provide more details in the README about how the length is stored.
In particular, I'd mention that it's int32
with little-endian ordering - just to help people to decode it correctly if their lz4 lib doesn't support it.
@ph1lm LGTM, I will use LZ4CompressorWithLength
Is your feature request related to a problem? Please describe. If we use LZ4 compression for replication then we should adjust a client code to decompress values. To decompress a value of unknown length, we have to reserve a large buffer since we don't know the original size of a value or we have to have tricky logic of increasing this buffer incrementally if a decompressed value doesn't fit in.
Describe the solution you'd like Write the original uncompressed size (
maxCompressedLength
) as a leading integer in the compressed value byte array. In this case, we will know the exact size of a buffer we should use for decompression on a client size. Or just useLZ4CompressorWithLength
. Similar to this.Describe alternatives you've considered I'd also appreciate any solution of how to decompress a value of unknown length with lz4-java in an efficient way.