coderforlife / ms-compress

Open source implementations of Microsoft compression algorithms
205 stars 46 forks source link

No ms_uncompressed_size function #9

Closed nemequ closed 9 years ago

nemequ commented 9 years ago

I see lznt1 and xpress have functions for it, but nothing in the mscomp.h API. Do you not want to expose it (for example, because LZX or Xpress-huffman will not support it, it unreliable, etc.), or is that just an oversight? If not, is it reliable for LZNT1 and Xpress?

nemequ commented 9 years ago

Possibly related: a finish argument for inflate?

coderforlife commented 9 years ago

ms_inflate has a finish argument...

coderforlife commented 9 years ago

ms_uncompressed_size: I am unsure how I want to handle that. Is it actually really useful? I mean in some cases you have to go through the entire decompression sequence (just not writing bytes) to determine the uncompressed size. None of these formats actually include the uncompressed size directly in the data. Some have some "shortcuts" I can take to calculate (for example, with LZNT1 I can count the number of chunks there are and multiply that by the chunk size, and only have to decompress the last chunk).

coderforlife commented 9 years ago

I have also wondered about adding stream-duplication functions (like zlib's deflateCopy / inflateCopy) or any other functions that would be useful?

nemequ commented 9 years ago

ms_inflate has a finish argument...

Sorry, I wasn't clear there. The fact that it has one is a bit weird. zilb, bzip2, lzham, etc. don't have anything like that. xz-utils has something vaguely similar, but that's only because it uses the same function for compression and decompression—AFAIK you're only ever supposed to call it with one value (LZMA_RUN) for decompression.

I can work with it in Squash (I actually require consumers to call something like that in order to emulate streaming for codecs which don't support it, it's just a bit odd and I thought you might want to get rid of it unless you're using it for something.

ms_uncompressed_size: I am unsure how I want to handle that. Is it actually really useful?

Yes. It lets you know how large of a buffer you need to allocate for decompression if you didn't store the uncompressed size out of band. Otherwise you just have to guess—IIRC Squash starts with next_power_of_two(compressed_size)<<2 when it has to do that, but for the most part I just punt on the issue.

None of these formats actually include the uncompressed size directly in the data.

Yeah, I wouldn't do it, then. If a user sees a function like that they will probably assume that it is constant time (and fast) and use it. IMHO it would be better to make them think about storing that information out of band.

If you really want to have something for this then maybe a try_decompress function that will return the number of bytes needed in the event decompression fails, but TBH I don't think it's worth it.

I have also wondered about adding stream-duplication functions (like zlib's deflateCopy / inflateCopy) or any other functions that would be useful?

I'm not really sure what the use case is for stream duplication functions.

As for other stuff which could be useful, I guess you're thinking about a FILE*-like API? I'm working on something like that for Squash. Maybe I'm biased, but my view is that the actual compression/decompression libraries should be pretty minimal and people can build wrappers on top of them if they need to.

coderforlife commented 9 years ago

Sorry, I wasn't clear there. The fact that it has one is a bit weird. zilb, bzip2, lzham, etc. don't have anything like that. xz-utils has something vaguely similar, but that's only because it uses the same function for compression and decompression—AFAIK you're only ever supposed to call it with one value (LZMA_RUN) for decompression.

I can work with it in Squash (I actually require consumers to call something like that in order to emulate streaming for codecs which don't support it, it's just a bit odd and I thought you might want to get rid of it unless you're using it for something.

One easy one to use it as it is right now is the call ms_inflate(stream, true) with in_avail = 0 and out_avail a bunch. The problem I had was because the way LZNT1 and Xpress work is that you only know you reached the end out compressed data is that there is no more compressed data to read.

However thinking about it now and looking through my code, setting finish = true only ends up activating error conditions which could be deferred to the calling of inflate_end. I will update this but give me some time.

If you really want to have something for this then maybe a try_decompress function that will return the number of bytes needed in the event decompression fails, but TBH I don't think it's worth it.

Internally I call it "decompression dry-run". But yeah, you now understand why I don't think it should be in mscomp because I should probably get rid of it from some of the individual ones. The LZNT1 is almost constant time, just a very large constant (plus a very small linear component to scan the header of each chunk). The xpress is heavily linear. Xpress_huffman could be made the same way LZNT1 is (nearly very-large constant).

I'm not really sure what the use case is for stream duplication functions.

The use case I have had for it was a 3D image format (2D images stack on top of each other) and each image was just laid out linearly in the data and the entire data was compressed. I would initially go through the whole file and whenever a new image started, duplicate the decompressor state and save it. This allowed me to go back and to a particular image without decompressing all images before by just restoring the decompressor state (a little more complicated than that, but you get the idea). You might ask why didn't I just decompress the whole thing? Well it was a terapixel image...

As for other stuff which could be useful, I guess you're thinking about a FILE*-like API? I'm working on something like that for Squash. Maybe I'm biased, but my view is that the actual compression/decompression libraries should be pretty minimal and people can build wrappers on top of them if they need to.

I was actually wondering about all the different ones with zlib (like prime, reset, etc). When I was choosing a streaming API to use, I was considering going with a FILE-like API (you basically gave a read and write function, and an extra parameter to pass (such as FILE) and it would call read and write as necessary). But I decided to go with the zlib-style.

nemequ commented 9 years ago

However thinking about it now and looking through my code, setting finish = true only ends up activating error conditions which could be deferred to the calling of inflate_end. I will update this but give me some time.

It works for me as-is—I've already pushed the code, so from my point of view there is no reason to hurry.

The use case I have had for it was a 3D image format (2D images stack on top of each other) and each image was just laid out linearly in the data and the entire data was compressed. I would initially go through the whole file and whenever a new image started, duplicate the decompressor state and save it. This allowed me to go back and to a particular image without decompressing all images before by just restoring the decompressor state (a little more complicated than that, but you get the idea). You might ask why didn't I just decompress the whole thing? Well it was a terapixel image...

Clever solution, I like it.

I was actually wondering about all the different ones with zlib (like prime, reset, etc).

Something like get/set dictionary could be useful, especially when you want to compress fairly similar buffers individually (like rows in a database). It could even be used for 3D images pretty easily by doing something like occasionally suing a 2D slice a like a keyframe, then having proceeding slices share that dictionary until the next keyframe is reached. That could give you random access, with the worst case being you have to decode the keyframe and the slice you want. Of course I'm sure people have created much better 3D image formats…

When I was choosing a streaming API to use, I was considering going with a FILE-like API (you basically gave a read and write function, and an extra parameter to pass (such as FILE) and it would call read and write as necessary). But I decided to go with the zlib-style.

Thank you for that. zlib-style is great because you can layer anything you want on top of it easily. There are lots of libraries that expose an API like that (ZPAQ, libzling, and CRUSH come to mind), and I only recently added a decent way to work with it in Squash but it's a huge hack. Basically, I do the processing in a thread, and I have a yield function which waits on a condition. When the user supplies room in the input or output buffer the main thread signals the condition (waking the processing thread) then waits on the processing thread, which processes the data until it runs of of input or output, then waits on that condition until it gets some more. It's doable, but unpleasant.

coderforlife commented 9 years ago

Something like get/set dictionary could be useful, especially when you want to compress fairly similar buffers individually (like rows in a database).

Well, this isn't supported by the formats that I have mainly dealt with yet. LZX definitely has this and Xpress Huffman may be able to use something like this for its encoding alphabet.

Additionally, as I am changing "finish", I am looking at the fact that LZNT1 compression is supposedly able to support flushing now (I found a document on MSDN where they actually talk about it!). So LZNT1 and Xpress Huffman should support flushing (and maybe LZX). I can change 'finish' to flush in deflate.

So I am looking at the different types of "flushing" that zlib does and thinking about what I could technically do. Please tell me if these would be useful. What levels of flushing does Squash support for inflate/deflate? What are their general meanings?

nemequ commented 9 years ago

http://www.bolet.org/~pornin/deflate-flush.html is a good overview of what zlib supports. Squash only uses Z_SYNC_FLUSH, but a full flush would work, and maybe even partial (I haven't thought about that, though, since partial flushes in zlib are basically deprecated).

From Squash's point of view, flush means that all the input you have provided to the compressor will be available in the decompressed data as long as the decompressor has received all the compressed data up to that point.

AFAIK the main user for flushing is networking applications, where it is pretty critical (otherwise latency would be absurd, especially for algorithms with large block sizes). I have been thinking about writing a compression proxy server as a demo of Squash (think stunnel, but compression instead of SSL), and only algorithms capable of flushing would be supported. Right now Squash only supports flushing for zlib, bzip2, lzma/lzma2/xz, lzham, and snappy-framed, because those are the only implementation which support it.

coderforlife commented 9 years ago

The items brought up here have been committed, including adding flushing support for compression. LZNT1 currently completely supports this (by outputting valid but not-full chunks). Still working on Xpress, but it will support it (it will cause buffering due to half_byte waits to be cleared, but I don't think I can clear buffering due to out_flags holds... I will have to think on this).