Deflat Compressor Length

Bit00009 commented 4 years ago

Hi there dear @ebiggers First thank you for this amazing library, I just came from zlib to your library and I have some difficulties. I want to do a very simple deflate compression on a buffer and I can't figure it out how to deal with length.

I know there's two way to deal with length in gzip and lz4 , I worked with them before and I was adding original data size as a little header in my compressed data but In this case every single byte is important to me.

In C# there's a internal compressor known as DeflatStream we just pass a byte array and get a byte array back :

public static byte[] Compress(byte[] data)
{
    MemoryStream output = new MemoryStream();
    using (DeflateStream dstream = new DeflateStream(output, CompressionLevel.Optimal))
    {
        dstream.Write(data, 0, data.Length);
    }
    return output.ToArray();
}
public static byte[] Decompress(byte[] data)
{
    MemoryStream input = new MemoryStream(data);
    MemoryStream output = new MemoryStream();
    using (DeflateStream dstream = new DeflateStream(input, CompressionMode.Decompress))
    {
        dstream.CopyTo(output);
    }
    return output.ToArray();
}

Result data doesn't contain any thing related to original size data. So I tried to do the same using libdeflate and here's my current code :

#include <iostream>
#include <vector>
#include <fstream>

#include "libdeflate.h"
#pragma comment( lib , "libdeflatestatic.lib" )

using namespace std;

#define LOG(x) printf("%s\n",x)
#define LOGVAL(valname,val) printf("Value `%s` = %d\n",valname,val)

int main() 
{
    struct libdeflate_compressor* compressor;
    struct libdeflate_decompressor* decompressor;
    ofstream fout;

    std::ifstream input("inputdata.dta", std::ios::binary);

    std::vector<unsigned char> buffer(std::istreambuf_iterator<char>(input), {});

    const int originalDataSize = buffer.size();

    // Compress
    compressor = libdeflate_alloc_compressor(9);
    std::vector<unsigned char> outdata;
    outdata.resize(originalDataSize);

    LOG("Compressing...");
    int cmpsize = libdeflate_deflate_compress(compressor, buffer.data(), originalDataSize, outdata.data(), originalDataSize);

    LOGVAL("cmpsize",cmpsize);

    fout.open("compressed.bin", ios::binary | ios::out);
    fout.write((char*)outdata.data(), cmpsize);
    fout.close();

    libdeflate_free_compressor(compressor);

    // Decompress
    size_t dec_size;
    std::vector<unsigned char> dec_outdata;
    dec_outdata.resize(originalDataSize);

    decompressor = libdeflate_alloc_decompressor();
    libdeflate_deflate_decompress(decompressor, outdata.data(), cmpsize, dec_outdata.data(), originalDataSize, &dec_size);

    fout.open("decompressed.bin", ios::binary | ios::out);
    fout.write((char*)dec_outdata.data(), dec_size);
    fout.close();

    libdeflate_free_decompressor(decompressor);

    LOG("App finished.");
}

As you can see I always need originalDataSize to make compression/decompression work well, I don't know if I'm doing something wrong or not but I need to do it like how C# DeflateStream works, I need to pass a new vector or byte array without knowing original size and do the decompression.

I want to have a pure deflate result in compression with out adding any metadata or extra information I will be very greatful if you tell me how is this possible in c++ within your library.

Regards, Ultran

ebiggers commented 4 years ago

You'd need to guess some uncompressed size, allocate a buffer of that size, then try decompressing into it. If it fails with LIBDEFLATE_INSUFFICIENT_SPACE, enlarge the buffer and try again. And so on.

It's not really a good solution. It's much better to just store the uncompressed size along with the compressed data.

It only takes a few bytes to store the uncompressed size -- or less if you store an approximate size only.

Note that your code snippet only uses compression level 9, while libdeflate goes up to level 12. You probably could save much more than a few bytes by using a higher compression level.

Bit00009 commented 4 years ago

Hello @ebiggers thanks for your fast reply, Yes I'm currently use guessing but it's extremely slow when my data becomes more than 5 MB I'm really ceriouse how C# DeflateStream manage to handle same situation and It's fast enough , the source is available here.

I noticed there's a buffer and windows size value in .net DeflateStream :

        internal const int DefaultBufferSize = 8192;
        private const int WindowSizeUpperBound = 47;

What do you think about it? How it's handling unknown size?

It's much better to just store the uncompressed size along with the compressed data.

I want to prevent data splitting as much as I can but if it's the last thing I can do, I'll do but I'm dying from curiosity how .net managed deflate compression/decompression.

Also i tried a very small test data (24 byte) DeflateStream compressed file but libdeflate returns 0 byte and compressed file is empty. I attached files. InputData.txt

This is deflatestream output compressed-deflatestream.gz

Regards, Ultran

ebiggers commented 4 years ago

What do you think about it? How it's handling unknown size?

Your C# code snippet streams the data to a MemoryStream, which implements an automatically-resizing array. So it could end up copying the data many times as it incrementally reallocates the array, and end up with a buffer up to 2x larger than is required.

You can improve the performance of both the C# version and the libdeflate version, and also make it much easier to use libdeflate, by storing the uncompressed size along with the compressed data.

Also i tried a very small test data (24 byte) DeflateStream compressed file but libdeflate returns 0 byte and compressed file is empty.

Check libdeflate.h for how to use the API.

dmitry-azaraev commented 4 years ago

@Bit00009 internally any deflate decoder rely on BFINAL bit in stream so them generally has idea when they should stop. Also there is exist some kind of natural chunking in deflate, so it may adapted to work with input/output buffers which may essentially be represented under Stream-like facade.

Technically, at least for decompressing, libdeflate can be extended to support to work with chunks (i guess this needs injection points at decompress_template.h about next_block / block_done), but... while it is might looks as easy, it is not. (And i has no idea about compression at all.)

"Stream"-like facade - is not so simple in fact, as it may look. Just imagine what you have file with 80,000 small compressed blocks (say 500-1.5Kb) which are written in stream one-by-one. Each instance of DeflateStream will buffer input (from file stream) and will decode (decompress) stream. As result after first decoded block (let's say it was 500 bytes) it actually consumes 8Kb (internal buffer), and original file stream essentially no more point to next uncompressed block (did you know this?), like it should be. DeflateStream even doesn't provide information how many bytes it really consumed. (In my case I has compressed data size, so i have options - create another stream which will limit how many bytes DeflateStream may read (just return EOF after N bytes). Or just use codecs like libdeflate which work with raw buffers, as result skipping of this complex machinery saves from lot of allocations, resulting in at least twice better throughput.) Buffered streams which provide access to internal buffer for consumer (consumer is something like DeflateStream) - will have more flexibility, but in .NET is not a option and generally it is not a popular option at all.

I'm generally trying to show what blind reproducing of generic Stream-like interface - is a way to making library with stupid and hard to use interface. Which one interface will be easy to use and satisfy all requirements? I don't know.

jibsen commented 4 years ago

I guess there could be a function that simulates the decompression without actually writing any output. It might be useful for checking if a given block of data contains a valid deflate stream, and could report the number of input bytes consumed and output bytes produced, which would be useful here. This might be outside the scope of libdeflate though.

ebiggers commented 4 years ago

I guess there could be a function that simulates the decompression without actually writing any output. It might be useful for checking if a given block of data contains a valid deflate stream, and could report the number of input bytes consumed and output bytes produced, which would be useful here. This might be outside the scope of libdeflate though.

Sure, but that would encourage people to do the wrong thing (decompress the data twice) instead of the right thing (store the uncompressed size along with the compressed data). And I'd like to keep the API simple. So I think it's better to not add it, and instead just encourage people to do the right thing.

wegylexy commented 2 years ago

When the deflated stream is like several MB, partial read results in LIBDEFLATE_INSUFFICIENT_SPACE or DestinationTooSmall. It is unreasonable to require 2 continuous blocks of memory that big. For example, a 4MB gitpack may be inflated to 13MB.

ebiggers / libdeflate

Deflat Compressor Length #69