ConorStokes / LZSSE

LZ77/LZSS designed for SSE based decompression
BSD 2-Clause "Simplified" License
135 stars 16 forks source link

Add functions to return the maximum compressed size #5

Closed nemequ closed 8 years ago

nemequ commented 8 years ago

It would be helpful if there were functions which returned the maximum amount of space needed to compress a buffer. Something like zlib's compressBound.

ConorStokes commented 8 years ago

I can add them, but for reference data that can't be compressed smaller than the input will just return the original data, so the maximum space required is always input size.

If you're getting data that compressed larger than input, let me know, as that's a bug. On 2 Mar 2016 4:38 pm, "Evan Nemerson" notifications@github.com wrote:

It would be helpful if there were functions which returned the maximum amount of space needed to compress a buffer. Something like zlib's compressBound.

— Reply to this email directly or view it on GitHub https://github.com/ConorStokes/LZSSE/issues/5.

nemequ commented 8 years ago

can add them, but for reference data that can't be compressed smaller than the input will just return the original data, so the maximum space required is always input size.

Okay, in that case I don't think it's necessary.

If you're getting data that compressed larger than input, let me know, as that's a bug.

Yeah, I am. The easiest way for me to generate examples was to throw together a quick program and fuzz it, and I'm getting lots of results very quickly. The most common issue is the compressed data being larger than the input, but I'm also seeing lots of cases where decompress(compress(input)) != input.

Anyways, here is the test program I'm using:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <stdint.h>
#include <string.h>

#include "lzsse2.h"

static void
test_round_trip(const uint8_t* uncompressed, size_t uncompressed_size) {
  uint8_t* compressed = (uint8_t*) malloc(uncompressed_size * 2);
  assert(compressed != NULL);

  LZSSE2_OptimalParseState* state = LZSSE2_MakeOptimalParseState(uncompressed_size);
  assert(state != NULL);
  size_t compressed_size = LZSSE2_CompressOptimalParse(state, uncompressed, uncompressed_size, compressed, uncompressed_size * 2, 7);
  assert(compressed_size != 0);
  if(uncompressed_size < compressed_size) {
    fprintf(stderr, "%zu -> %zu\n", uncompressed_size, compressed_size);
    abort();
  }
  LZSSE2_FreeOptimalParseState(state);

  uint8_t* decompressed = (uint8_t*) malloc(uncompressed_size);
  size_t decompressed_size = LZSSE2_Decompress(compressed, compressed_size, decompressed, uncompressed_size);
  assert(decompressed_size == uncompressed_size);
  assert(memcmp(decompressed, uncompressed, uncompressed_size) == 0);

  free(decompressed);
  free(compressed);
}

int main(int argc, char** argv) {
  uint8_t* data = (uint8_t*) malloc(1024);
  size_t data_length = 0;

  while (!feof(stdin)) {
    const size_t bytes_read = fread(data + data_length, 1, 1024, stdin);
    data_length += bytes_read;
    data = (uint8_t*) realloc(data, data_length + 1024);
    assert(data != NULL);
  }

  test_round_trip(data, data_length);

  free(data);

  return EXIT_SUCCESS;
}

Archive with a bunch of crashes: http://code.coeusgroup.com/afl-results/4697755e-2236-433b-b407-64e534cdfc90.tar.xz

ConorStokes commented 8 years ago

Thanks, I'll have a look.

On Wed, Mar 2, 2016 at 6:32 PM, Evan Nemerson notifications@github.com wrote:

can add them, but for reference data that can't be compressed smaller than the input will just return the original data, so the maximum space required is always input size.

Okay, in that case I don't think it's necessary.

If you're getting data that compressed larger than input, let me know, as that's a bug.

Yeah, I am. The easiest way for me to generate examples was to throw together a quick program and fuzz it, and I'm getting lots of results very quickly. The most common issue is the compressed data being larger than the input, but I'm also seeing lots of cases where decompress(compress(input)) != input.

Anyways, here is the test program I'm using:

include

include

include

include

include

include "lzsse2.h"

static voidtest_round_trip(const uint8_t* uncompressed, size_t uncompressed_size) { uint8_t* compressed = (uint8_t) malloc(uncompressed_size \ 2); assert(compressed != NULL);

LZSSE2_OptimalParseState* state = LZSSE2_MakeOptimalParseState(uncompressed_size); assert(state != NULL); size_t compressed_size = LZSSE2_CompressOptimalParse(state, uncompressed, uncompressed_size, compressed, uncompressed_size * 2, 7); assert(compressed_size != 0); if(uncompressed_size < compressed_size) { fprintf(stderr, "%zu -> %zu\n", uncompressed_size, compressed_size); abort(); } LZSSE2_FreeOptimalParseState(state);

uint8_t* decompressed = (uint8_t*) malloc(uncompressed_size); size_t decompressed_size = LZSSE2_Decompress(compressed, compressed_size, decompressed, uncompressed_size); assert(decompressed_size == uncompressed_size); assert(memcmp(decompressed, uncompressed, uncompressed_size) == 0);

free(decompressed); free(compressed); } int main(int argc, char* argv) { uint8_t* data = (uint8_t) malloc(1024); size_t data_length = 0;

while (!feof(stdin)) { const size_t bytes_read = fread(data + data_length, 1, 1024, stdin); data_length += bytes_read; data = (uint8_t*) realloc(data, data_length + 1024); assert(data != NULL); }

test_round_trip(data, data_length);

free(data);

return EXIT_SUCCESS; }

Archive with a bunch of crashes: http://code.coeusgroup.com/afl-results/4697755e-2236-433b-b407-64e534cdfc90.tar.xz

— Reply to this email directly or view it on GitHub https://github.com/ConorStokes/LZSSE/issues/5#issuecomment-191175517.

ConorStokes commented 8 years ago

Okay, figured out one of the issues - wasn't rounding up the last set of controls to a full block in the parse calculation costs for the optimal parsers - causing a slight size under-estimation which on very particular data would cause it not to early-exit before the output parse. Early versions actually overlapped the last control block with the trailing data, so this was alright, but this slipped through when I changed it to always use full control blocks. I'll do some more testing and see if I can uncover any other issues.

On Wed, Mar 2, 2016 at 6:56 PM, Conor Stokes conor.stokes2@gmail.com wrote:

Thanks, I'll have a look.

On Wed, Mar 2, 2016 at 6:32 PM, Evan Nemerson notifications@github.com wrote:

can add them, but for reference data that can't be compressed smaller than the input will just return the original data, so the maximum space required is always input size.

Okay, in that case I don't think it's necessary.

If you're getting data that compressed larger than input, let me know, as that's a bug.

Yeah, I am. The easiest way for me to generate examples was to throw together a quick program and fuzz it, and I'm getting lots of results very quickly. The most common issue is the compressed data being larger than the input, but I'm also seeing lots of cases where decompress(compress(input)) != input.

Anyways, here is the test program I'm using:

include

include

include

include

include

include "lzsse2.h"

static voidtest_round_trip(const uint8_t* uncompressed, size_t uncompressed_size) { uint8_t* compressed = (uint8_t) malloc(uncompressed_size \ 2); assert(compressed != NULL);

LZSSE2_OptimalParseState* state = LZSSE2_MakeOptimalParseState(uncompressed_size); assert(state != NULL); size_t compressed_size = LZSSE2_CompressOptimalParse(state, uncompressed, uncompressed_size, compressed, uncompressed_size * 2, 7); assert(compressed_size != 0); if(uncompressed_size < compressed_size) { fprintf(stderr, "%zu -> %zu\n", uncompressed_size, compressed_size); abort(); } LZSSE2_FreeOptimalParseState(state);

uint8_t* decompressed = (uint8_t*) malloc(uncompressed_size); size_t decompressed_size = LZSSE2_Decompress(compressed, compressed_size, decompressed, uncompressed_size); assert(decompressed_size == uncompressed_size); assert(memcmp(decompressed, uncompressed, uncompressed_size) == 0);

free(decompressed); free(compressed); } int main(int argc, char* argv) { uint8_t* data = (uint8_t) malloc(1024); size_t data_length = 0;

while (!feof(stdin)) { const size_t bytes_read = fread(data + data_length, 1, 1024, stdin); data_length += bytes_read; data = (uint8_t*) realloc(data, data_length + 1024); assert(data != NULL); }

test_round_trip(data, data_length);

free(data);

return EXIT_SUCCESS; }

Archive with a bunch of crashes: http://code.coeusgroup.com/afl-results/4697755e-2236-433b-b407-64e534cdfc90.tar.xz

— Reply to this email directly or view it on GitHub https://github.com/ConorStokes/LZSSE/issues/5#issuecomment-191175517.

ConorStokes commented 8 years ago

Another issue found: if the very last control word is a 0 end to an extended match, but is the first control in a control block, a control block gets output that doesn't contribute to the input, so it adds no output to the stream (buffer checks exit at end of output, but input cursor hasn't advanced to the end). I'm fixing it by not outputting said last control block.

On Wed, Mar 2, 2016 at 8:32 PM, Conor Stokes conor.stokes2@gmail.com wrote:

Okay, figured out one of the issues - wasn't rounding up the last set of controls to a full block in the parse calculation costs for the optimal parsers - causing a slight size under-estimation which on very particular data would cause it not to early-exit before the output parse. Early versions actually overlapped the last control block with the trailing data, so this was alright, but this slipped through when I changed it to always use full control blocks. I'll do some more testing and see if I can uncover any other issues.

On Wed, Mar 2, 2016 at 6:56 PM, Conor Stokes conor.stokes2@gmail.com wrote:

Thanks, I'll have a look.

On Wed, Mar 2, 2016 at 6:32 PM, Evan Nemerson notifications@github.com wrote:

can add them, but for reference data that can't be compressed smaller than the input will just return the original data, so the maximum space required is always input size.

Okay, in that case I don't think it's necessary.

If you're getting data that compressed larger than input, let me know, as that's a bug.

Yeah, I am. The easiest way for me to generate examples was to throw together a quick program and fuzz it, and I'm getting lots of results very quickly. The most common issue is the compressed data being larger than the input, but I'm also seeing lots of cases where decompress(compress(input)) != input.

Anyways, here is the test program I'm using:

include

include

include

include

include

include "lzsse2.h"

static voidtest_round_trip(const uint8_t* uncompressed, size_t uncompressed_size) { uint8_t* compressed = (uint8_t) malloc(uncompressed_size \ 2); assert(compressed != NULL);

LZSSE2_OptimalParseState* state = LZSSE2_MakeOptimalParseState(uncompressed_size); assert(state != NULL); size_t compressed_size = LZSSE2_CompressOptimalParse(state, uncompressed, uncompressed_size, compressed, uncompressed_size * 2, 7); assert(compressed_size != 0); if(uncompressed_size < compressed_size) { fprintf(stderr, "%zu -> %zu\n", uncompressed_size, compressed_size); abort(); } LZSSE2_FreeOptimalParseState(state);

uint8_t* decompressed = (uint8_t*) malloc(uncompressed_size); size_t decompressed_size = LZSSE2_Decompress(compressed, compressed_size, decompressed, uncompressed_size); assert(decompressed_size == uncompressed_size); assert(memcmp(decompressed, uncompressed, uncompressed_size) == 0);

free(decompressed); free(compressed); } int main(int argc, char* argv) { uint8_t* data = (uint8_t) malloc(1024); size_t data_length = 0;

while (!feof(stdin)) { const size_t bytes_read = fread(data + data_length, 1, 1024, stdin); data_length += bytes_read; data = (uint8_t*) realloc(data, data_length + 1024); assert(data != NULL); }

test_round_trip(data, data_length);

free(data);

return EXIT_SUCCESS; }

Archive with a bunch of crashes: http://code.coeusgroup.com/afl-results/4697755e-2236-433b-b407-64e534cdfc90.tar.xz

— Reply to this email directly or view it on GitHub https://github.com/ConorStokes/LZSSE/issues/5#issuecomment-191175517.

nemequ commented 8 years ago

Here are some more examples, this time also including some for lzsse8: http://code.coeusgroup.com/afl-results/33b4af09-22b3-465e-bde0-c428f372d1c3.tar.xz

ConorStokes commented 8 years ago

Yeah, I have a set of fixes that have been applied to both which I will merge in tonight, currently all compressors pass that first set, I will run this set too. On 3 Mar 2016 8:24 am, "Evan Nemerson" notifications@github.com wrote:

Here are some more examples, this time also including some for lzsse8: http://code.coeusgroup.com/afl-results/33b4af09-22b3-465e-bde0-c428f372d1c3.tar.xz

— Reply to this email directly or view it on GitHub https://github.com/ConorStokes/LZSSE/issues/5#issuecomment-191506674.

ConorStokes commented 8 years ago

The first set of fixes are up now.

On Thu, Mar 3, 2016 at 8:27 AM, Conor Stokes conor.stokes2@gmail.com wrote:

Yeah, I have a set of fixes that have been applied to both which I will merge in tonight, currently all compressors pass that first set, I will run this set too. On 3 Mar 2016 8:24 am, "Evan Nemerson" notifications@github.com wrote:

Here are some more examples, this time also including some for lzsse8: http://code.coeusgroup.com/afl-results/33b4af09-22b3-465e-bde0-c428f372d1c3.tar.xz

— Reply to this email directly or view it on GitHub https://github.com/ConorStokes/LZSSE/issues/5#issuecomment-191506674.

ConorStokes commented 8 years ago

With the fixes applied to both LZSSE2 and LZSSE8, I'm not getting any results larger than uncompressed, or any mismatches between original and round-tripped data on any of the fuzzing samples provided, but I'm using a Visual Studio compiled version for the test. I can't rule out any of the issues being platform/compiler specific.

nemequ commented 8 years ago

Everything works for me, too. I let the fuzzer run overnight against the latest version, no issues. Closing.

I'll try to start fuzzing the decompressor soon; that's where most issues in compression libraries usually come up.