ByteBuffer or InputStream support for VCDiffEncoder Dictionary

Omkar-Shetkar commented 4 years ago

VCDiffEncoder can be created using VCDiffEncoderBuilder. Here, for source content we need to specify byte[] as input to withDictionary().


  public synchronized VCDiffEncoderBuilder withDictionary(byte[] dictionary) {
        this.dictionary = dictionary;
        return this;
    }

Source content can be as large as > 1GB. I think for better performance for large files Dictionary can be accepted as either ByteBuffer or InputStream. I think this is the most common use case while using this library for large files. Will you please consider this change in your next release ? Thanks.

ehrmann commented 4 years ago

A ByteBuffer would be pretty straightforward since the backing code already uses one. Assuming it's a MappedByteBuffer, you'd see a performance hit while encoding because the dictionary isn't in memory.

Are you just looking for better initialization performance? The dictionary needs to be loaded into memory as soon as the encoder is created/used, so the only benefit to a ByteBuffer or InputStream would be pipelining one of the load steps. The byte[] that's passed in is used internally.

Omkar-Shetkar commented 4 years ago

If I understood correctly, for encoding we need to have whole of dictionary content into memory. If yes, I think for large files and high traffic applications this could cause out of memory issues. I was wondering is there any way we can provide dictionary in chunks similar to VCDiffStreamingDecoder.decodeChunk() while decoding.

ehrmann commented 4 years ago

for encoding we need to have whole of dictionary content into memory

More or less (ignoring memory-mapped files and swapping). The next chunk of data could reference any part of the dictionary, and you'd have to check it.

I was wondering is there any way we can provide dictionary in chunks

Both encoding and decoding can be done on data chunks because the chunk can be compressed looking at the dictionary (or previous output), and the output written in a chunk. This doesn't work for dictionaries because any part of the dictionary can be referenced during encoding and decoding.

for large files and high traffic applications this could cause out of memory issues

You can share the same dictionary byte[] between requests (vcdiff-java doesn't modify it), but yes, you could see issues. Using a mapped ByteBuffer would also cause a lot of IO. The dictionary gets turned into a BlockHash for fast lookups during encoding. This also has memory overhead.

Adding support for a ByteBuffer dictionary is pretty straightforward, but I'm not sure if it's what you really want. It sounds like a 1GB dictionary is too big for where you're running. You might want to gzip the compressed data; vcdiff doesn't do any huffman encoding. This is a little bit like using xz; depending on the settings, it's easy for it to use more memory than your system has.

ehrmann / vcdiff-java

ByteBuffer or InputStream support for VCDiffEncoder Dictionary #6