Weird issues with using the streaming API vs `ZSTD_compress()`

KrzysFR commented 8 years ago

I trying out the streaming API to write the equivalent of .NET's GZipStream (source), and I'm seeing some strange things.

I'm using 0.6.1 for the tests, though the API seems to be unchanged in 0.7 at the moment.

A stream works by having an internal buffer of 128 KB (131,072 bytes exactly). Each call to Write(..) appends any number of bytes to the buffer (could be called with 1 byte, could be called with 1 GB). Everytime the buffer is full, its content is compressed via ZSTD_compressContinue() on an empty destination buffer, and the result is copied into another stream down the line. When the producer is finished writing, it will Close the stream which will compress any pending data in its internal buffer (so anywhere between 1 and 131,071 bytes), call zstd_compress_end, and flush the final bytes to the stream.

Seen from zstd, the pattern looks like:

ZSTD_compressBegin()
ZSTD_compressContinue() 131,072 bytes
ZSTD_compressContinue() 131,072 bytes
...
ZSTD_compressContinue() 123 bytes (last chunk will always be < 128KB)
ZSTD_compressEnd()

I'm comparing the final result, with calling ZSTD_compress() on the complete content of the input stream (ie: storing everything written into a memory buffer, and compress that in one step).

Issue 1: ZSTD_compress() adds an extra empty frame at the start

Looking at the compressed result, I see that usually a single call to ZSTD_compress() adds 6 bytes to the input.

The left side is the compressed output of ZSTD_compress() on the whole file. The right side is the result of streaming with chunks of 128 KB on the same data:

Left size: 23,350 bytes Right size: 23,344 bytes

The green part is identical between both files, only 7 bytes differ right after the header, and before the first compressed frame.

Both results, when passed to ZSTD_decompress() return the same input text with no issues.

Issue 2: N calls to ZSTD_compressContinue() produce N time the size of a single call to ZSTD_compress() on highly compressible data

While testing with some text document, duplicated a bunch of time to get to to about 300KB (ie: the same 2 or 3 KB of text repeated about 100 times), I'm getting something strange

The result of calling zstd_compress on the whole 300KB returns a single 2.5 KB output.
The result of streaming using 3 calls to ZSTD_compressContinue() produces 7.5 KB output (3 times larger).

Looking more closely: each call to ZSTD_compressContinue() returns 2.5 KB (first two calls with 128KB worth of text, third call with only 50 KB), which is too exact to be a coincidence.

Since the dataset is the equivalent of "ABCABCABC..." a hundred times, I'm guessing that compressing 25%, 50% or 100% of it would produce the same output, which would look something like "repeat 'ABC' 100 times" vs "repeat 'ABC' 200 times".

Only, when compressing 25% at a time, you get 4 times as many calls to ZSTD_compressContinue(), which will give you 4 times the output. Compressing 12.5% at a time would probably yield 8 times the output.

When changing the internal buffer size from 128KB down to 16KB, I get a result of 45 KiB, which is about 6x times more than before.

Fudging the input data to get lower compression ratio makes this effect disappear progressively, until a point where the result of the streaming API is about the same as a single compression call (except the weird extra 6 bytes in the previous issue).

Cyan4973 commented 8 years ago

Issue 1 : The difference is into the header : ZSTD_compress() has an advantage : it knows the source size to compress. By default, this information will be stored into the frame header. The size needed to store this field is variable. In version v0.6.x it will be 1, 2 or 8 bytes, depending on srcSize.

The bufferless streaming mode doesn't have such a chance, so it's unable to store this field. In some cases it may also have an impact on compression ratio, memory allocation and speed, because without such information, it will have to target "a very large input", even if, in the end, it turns out data to compress is small.

In your sample, the difference in header size is 8 bytes, so it corresponds pretty well to a srcSize > 64 KB.

Both behaviors can be modified using the more complex _advanced() variants, which offers direct control over parameters, including which field will be present or not. It's also possible to tell to the streaming version how many bytes are going to be compressed.

Issue 2 : This one is more complex. I will probably need to have a look into some sample.

As a first hint, keep in mind that each call to ZSTD_compress_continue() creates a fully formed block (or several if srcSize > blockSize == 128 KB (in general) ). So it's not the same as a single invocation of ZSTD_compress(). At the very least, there will be more headers. 2.5 KB seems nonetheless a bit too large for headers, which total budget should rather be in the ~200 bytes range.

My guess is that for some reason the next call to ZSTD_compress_continue() is unable to find or make reference to previous data in previous block. Hence it has to repeat it. Hence each block starts with a description of the sentence to be repeated N times.

There are several possible reasons that could explain this effect. This will depend a lot on the parameters used during initialization, and a bit on the sample file itself.

I'm afraid a better investigation will require more details...

KrzysFR commented 8 years ago

Ok for Issue 1, this makes sense for stream to not emit the final input size.

For issue 2, take the lyrics of your favorite songs, dump them as UTF-8, and repeat the buffer until you reach ~300 KB, that's how I constructed the input.

The stream always compress exactly 128 KB chunks, except the last chunk of course which will be smaller.

One idea: the stream reuse the same 128 KB buffer that will be the input passed to ZSTD_compressContinue() which will only see that part. On the other hand ZSTD_compress() has access to the whole data. I've seen code in zstd_compress.c which attempts to detect if the next src pointer is contiguous with the previous one:

see https://github.com/Cyan4973/zstd/blob/9dd12742f3992dc2f97f0a544e218633253fc82c/lib/compress/zstd_compress.c#L2203

    /* Check if blocks follow each other */
    if (src != zc->nextSrc) {
        /* not contiguous */
        size_t const delta = zc->nextSrc - ip;
        zc->lowLimit = zc->dictLimit;
        zc->dictLimit = (U32)(zc->nextSrc - zc->base);
        zc->dictBase = zc->base;
        zc->base -= delta;
        zc->nextToUpdate = zc->dictLimit;
        if (zc->dictLimit - zc->lowLimit < 8) zc->lowLimit = zc->dictLimit;   /* too small extDict */
    }

Maybe if successive input buffers are contiguous in memory, you are allowed to see into the previous one?

This could be troublesome, because then it means that the compression ratio depends not only on the input data, but also on the implementation details of the application (size of buffers selected, etc...)

Cyan4973 commented 8 years ago

Maybe if succesive input buffers are contiguous in memory, you are allowed to see into the previous one?

Yes. And to be more precise, the stream is allowed to see all previous contiguous blocks + one detached segment.

This is supposed to work well in many "obvious" scenarios, such as a double-buffer, a round buffer, or a dictionary compression.

Now, if you re-use exactly the same input buffer consecutively, ZSTD_compress_continue() will presume srcBuffer has been overwritten by new data, and flush that information from its memory. Hence it would restart from zero.

Indeed, implementation details are important to understand what happens in your scenario.

PS : and one last thing, ZBUFF is an interface which avoids such implementation dependency. If you want the behavior to be consistent whatever the segments size and position, you'll be well served by ZBUFF.

KrzysFR commented 8 years ago

Question regarding issue 1:

How can I easily access the original input size, when created by the non-streaming API (ie: ZSTD_compress())? I've previously had to store the original size out of band, but if it is in the compressed output itself (and uses up to 6 bytes already) it would be nice to have a way to read it, so that I can pre-allocate the buffer and then call ZSTD_compress().

KrzysFR commented 8 years ago

Question regarding issue 2:

I'm wondering if there would be a potential issue when doing pooling of buffers used for compression:

Note: All my buffers have size 128 KB

Let's say I used a pooled buffer at address PTR to compress some input on a thread T1. Then this buffer goes back into the pool, and another thread T2 gets the same buffer at PTR from the pool, to compress another stream of data.

Now, thread T1 is scheduled again to compress another part of the stream, but this time obtain from the pool a different buffer at address PTR+128KB (ie: right where nextSrc would be in the compression context), while T2 is still not done with the buffer at PTR. If I now call ZSTD_compressContinue() from T1, wouldn't it think that the new buffer is the continuation of the previous buffer (which now contains input from T2) and incorrectly uses that data for matching?

I can see a slab allocators grabbing slabs of 1 MB or more from the OS, and carving chunks of exactly 128 KB from it, to fill the buffer pool, meaning we could end up with contiguous buffers that would be served to various threads in any order.

Other allocators which add some header between chunks would probably not encounter this problem (since the nextSrc would not exactly line up with the next buffer because of the headers, or padding)

Cyan4973 commented 8 years ago

Question 1 :

How can I easily access the original input size, when created by the non-streaming API ?

See ZSTD_getFrameParams() . Look for frameContentSize field. If == 0, it means "unknown"

Cyan4973 commented 8 years ago

Question 2 :

... is more complex to answer, and I'm not sure to understand all details.

If you want to compress blocks independently, in any order, prefer ZSTD_compress() instead. No assumption is done about previous data, which is not used.

ZSTD_compressContinue() is only useful if one wants to use "prior blocks" to better compress current block. However, in such case, block order is important, and will have to be reproduced by the decoder. That's why I don't understand the concept of "2 threads sharing a pool of blocks" attributed in any order. It would only make sense in association with ZSTD_compress().

Moreover, ZSTD_compressContinue() presume data from previous blocks is still accessible and unmodified. If this condition is not respected, it will result in data corruption.

One way to be free of such issue is to use ZBUFF interface instead, which makes no such assumption, and guarantee to use previous data properly, whatever its position and size, even if it has been modified or freed. The price to pay is an intermediate buffer within ZBUFF context, which is transparently managed.

KrzysFR commented 8 years ago

I have a Stream-like object, which has its own context, and uses a buffer pool to get 128-KB. Each thread can create an instance of that stream, and Write(..) as much data as it wants (1 byte, 1 GB). Internally, each stream fills a 128 KB buffer. When the buffer is full, it gets compressed with a call to ZSTD_compressContinue() and emitted to the output, until the caller has finished sending all the data (when ZSTD_compressEnd() is called).

The contract of this stream is that the compressed output should be as close as possible as compressing the whole data in one step. Except that with streams you can use smaller buffers. This is especially useful when serializing JSON documents, straight to compressed bytes.

If two threads have each their own Stream instance, but everyone share the same buffer pool, it is possible that allocated buffers "move around" between instance in such a way as described above.

I'll try to explain better with a chart below.

KrzysFR commented 8 years ago

Cyan4973 commented 8 years ago

Corruption ?

Yes. ZSTD_compressContinue() presume data from previous blocks is still accessible and unmodified. If this condition is not respected, it will result in data corruption.

KrzysFR commented 8 years ago

Ah, I just realized that when you said to use ZBUFF_, I was reading ZHUFF_ and was wondering why I just use an Huffman compressor instead of ZStd....

Cyan4973 commented 8 years ago

Indeed.

It's btw an open question if we will keep ZBUFF_ interface separated or not. Initially, it was created as a user-layer of ZSTD_, hence the separation.

But I presume many users will prefer to have all functionalities exposed in a single zstd.h interface, and they risk missing the content of zbuff.h. In which case, this separation is not a good idea.

But if both interfaces are merged, keeping the ZBUFF_ prefix would look weird. So it should probably be changed into ZSTD_.

But then, how to make a clean distinction between the synchronous+bufferless streaming interface, and the buffered one ?

Quite a lot of little annoying questions, and as long as there is no clear&clean transition plan, the current situation (separated zstd.h and zbuff.h) keep on going...

KrzysFR commented 8 years ago

I think the current API is fine (to me at least). The only gotcha with ZSTD_compressContinue() is that it is not clear that it automatically infer from pointers being contiguous between calls as meaning 'previous data still exists in previous buffer'.

Maybe this should be opt-in: if you can guarantee that the old buffer is still around and untouched, then get better compression. If not, must reuse the same buffer but get lower compression ratio.

But then again: if you have to keep the old 128KB buffer immediately before your current 128KB buffer, why not allocate a single 256 KB buffer? I'm not sure to understand then the gain from using a streaming API (where you can reuse the same small buffer).

I'm not sure how you'd use this to do double buffering (for input) either: when you are compressing the second block, the first block could be in the process of being partially overwritten by the next fread call... ?

KrzysFR commented 8 years ago

Last thoughts on this: if ZBUFF can internally store the previous data in an internal buffer inside the context, this could be an option that can be surfaced at the top level API and would solve the compression ratio issue (at the cost of memcpy) for Stream-like abstractions.

The Stream instance has its own buffer (used to chunk the incoming input in nice 128-KB blocs), and the ZSTD context keeps what it needs so that the next call (reusing the stream buffer) would get the best compression possible.

Cyan4973 commented 8 years ago

ZSTD_compressContinue() is allowed to see and use all previous contiguous blocks + one detached segment (which can consist of multiple contiguous blocks).

If ZSTD_compressContinue() doesn't use previous blocks, then its usefulness is very limited, and ZSTD_compress() should be preferred.

I can better document it in zstd.h. But unfortunately I believe that's all I can do. I'm aware this interface is difficult, that's why it's exposed into the "Advanced/Experimental" section.

if you have to keep the old 128KB buffer immediately before your current 128KB buffer, why not allocate a single 256 KB buffer?

In double-buffer mode, this is in general the expectation : a single buffer of 256 KB, which is used alternatively by slice of 128 KB. When compressing a slice, the compressor can use the previous slice to improve compression ratio. But technically, slices don't need to be contiguous. 2 separated buffers of 128 KB each would work the same.

I'm not sure how you'd use this to do double buffering (for input) either: when your are compressing the second block, the first block could be in the process of being partially overwritten by the next fread call... ?

The above scenario is for a single thread, hence doing alternatively compression and i/o stuff. In a multi-threading scenario, it doesn't work.

To remain similar to above scenario (double buffer, 2x 128 KB), my recommendation would be to use 3 separated buffers of 128 KB each, "current", "previous" and "next". "Current" is being compressed, "previous" is used during compression and should not be modified. "Next" is being loaded in parallel. Switch roles when appropriate.

Of course, this is just one of many possible constructions.

if ZBUFF can internally store the previous data in an internal buffer, this could be an option that can be surfaced at the top level API and would solve the compression ratio issue (at the cost of memcpy) for Stream-like abstractions.

This is exactly what happens within ZBUFF. It can be loaded with any amount of data. ZBUFF store that into its own internal buffer, and wait for a block to be complete to start compression (or a flush() command alternatively).

ZBUFF can produce the same result as a single call to ZSTD_compress() (save the header), independently of the size and position of successive inputs.

The behavior of ZBUFF is much easier to understand. That's why it should be preferred in most streaming scenarios.

The more complex "raw streaming interface" is left for more advanced usages which ZBUFF cannot serve appropriately.

KrzysFR commented 8 years ago

I've taken a look at the API in zbuff.h and it looks indeed better targeted for Stream-like API, at least the way Streams work in .NET (and probably Java as well?)

I now see that in zstd.h, even though ZSTD_compressContinue() is under the Streaming functions header (source of my confusion), there is the mention (direct mode - synchronous and buffer-less) which - in hindsight - explains what is going on.

The ZBUFF_compressXXX API looks very similar to ZSTD_compressXXX, though if I understand correctly, it will not output the initial ZSTD_MAGICNUMBER header to the destination buffer?

So if I want to produce something that can be decompressed by ZSTD_decompress() transparently, I should:

Allocate my 128 KB internal input buffer and 128 KB+epsilon internal output buffer
Call ZBUFF_createCCtx(...)
Emit ZSTD_MAGICNUMBER to the output buffer
Call ZBUFF_compressInit[Dictionary](...)
Call ZBUFF_compressContinue(..block of 128 KB...) repeatedly as I fill my input buffer
- every time the internal output buffer is close to full, flush it to the destination
Call ZBUFF_compressContinue(..remainder < 128 KB...) with the last input block
Call ZBUFF_compressEnd(...)
- Flush any remaining data in the output buffer to the destination

A few remarks:

Note that these are coming from someone interacting with the API from a managed language. Not sure if they would apply to other contexts

ZSTD_MAGICNUMBER is a constant in zstd.h which is not accessible from languages that consume the .dll and not the .h. I will need to put that constant in my code, and always remember to synchronize it. Unless I can call a method that writes that for me in the dest buffer?
Are there also any footer bytes that need to be added after the frame is closed by ZSTD_compressEnd() ?
Would it be possible to add the total input size as the LAST part of the compressed result? (instead of at the start like ZSTD_compress() does?). The Stream instance can track the total input size, and if the compressed output is stored on disk, it should be doable to parse the last N bytes of the data to extract the original size, and know how much to allocate when decompressing? Would the current ZSTD_decompress() silently ignore this extra bit of information, or complain?
Allocating a 128 KB + 3 + 3 output buffer is not really practical when using buffer pools that use power-of-two sizes, because the next size after 128 KB will be 256 KB (a lot of waste for just 6 extra bytes). I suppose that the only case where it would use those 6 extra bytes, is if the block is not compressible, and stored as-is with the additional frame headers? Would it be possible for ZBUFF_compressContinue() to return 0, and have the caller output the raw block manually ? We would also need to be able to call a method to format a block header. That way, we would only need two 128 KB input buffers (which could come from the same buffer pool).
There is one mention in the comments of zbuff.h about ZBUFF_compressContinue() that I don't fully understand the repercussions:

Note that it may not consume the entire input, in which case it's up to the caller to present again remaining data.

When I pass full 128 KB input buffers (with an sufficiently large output buffer), in what case can this happen? Currently, the logic is "produce 128 KB buffers" -> "compress them with a call to Zxxx_compressContinue()" -> "copy the result to the destination" -> "clear input buffer".

If ZBUFF_compressContinue() can not fully consume the input, does this mean that I can simply call it again with the tail of the input buffer? Or should I wait until I fill another 128 KB worth of data?

Cyan4973 commented 8 years ago

Lots of legitimate questions here, so I will try to answer them all :

The ZBUFF_compressXXX API looks very similar to ZSTD_compressXXX, though if I understand correctly, it will not output the initial ZSTD_MAGICNUMBER header to the destination buffer?

ZBUFF API will produce fully compatible zstd frames. The result of its compression can be decoded by ZSTD_decompress(). And inversely, ZBUFF decompression API can decode frames produced by ZSTD_compress().

It's not necessary to manually add any magic number. In fact, it would break the frame.

ZSTD_MAGICNUMBER is a constant in zstd.h which is not accessible from languages that consume the .dll and not the .h. I will need to put that constant in my code, and always remember to synchronize it. Unless I can call a method that writes that for me in the dest buffer?

For normal uses, there should be no need for ZSTD_MAGICNUMBER. It's an internal identifier. Only specific advanced usages may have a use for it, hence why it's exposed into zstd.h (advanced section, static linking only).

Are there also any footer bytes that need to be added after the frame is closed by ZSTD_compressEnd() ?

ZSTD_compressEnd() purpose is to add the footer bytes. So there is nothing to do manually after that. Similarly, ZBUFF_compressEnd() purpose is also to flush the internal buffer and add the footer bytes. But here, don't forget to check if the internal buffer was fully flushed, otherwise the footer bytes might still be there inside.

Would it be possible to add the total input size as the LAST part of the compressed result?

Nope. The current compression format is not compatible with this construction.

More importantly, it would make little sense. The main purpose of "frame content size" is to give the receiver enough information to allocate the exact amount of memory required. Since allocation must be done at the beginning, before even starting decompression, if the receiver has to reach the end of the compressed frame to get that information, it's too late.

If the compressed output is stored on disk, it should be doable to parse the last N bytes of the data to extract the original size, and know how much to allocate when decompressing?

It presumes that the stream is seekable, which is a non-guaranteed property. It could work, but only for File systems (with a non-negligible seek performance cost). Any other type of stream will not be able to "seek to the end".

In such situation, the frame provides a "windowsize", which is the minimum size required to create a round buffer which guarantees decompression of the stream, whatever its real content size.

The reasoning is : if the object to compress is small, ZSTD_compress() is probably a better choice. If the object to compress is so large that its size is not known, then it will probably be too large for the decoder too.

Lastly, if the size of the object to compress is known, it's also possible to use ZBUFF_compress_advanced() to provide this information, which will then be in the header. If this alternative is not practical enough, please tell me.

Allocating a 128 KB + 3 + 3 output buffer is not really practical

You don't need to. ZBUFF API accepts any buffer size. In contrast with ZSTD "raw streaming" API, it will not fail, and just carry on.

So if you allocate only 128 KB for the output buffer, and if by any chance the flush operation requires more than that, ZBUFF will simply output 128 KB, and keep the rest internally. It will output the remaining part on next call.

There is a downside though : since ZBUFF cannot guarantee that the output will necessarily fit into destination buffer, it must compress into its own internal buffer, and then flush. If the size was large enough to begin with, it would have written directly into the output buffer, saving a flush.

There is one mention in the comments of zbuff.h about ZBUFF_compressContinue() that I don't fully understand the repercussions:

Note that it may not consume the entire input, in which case it's up to the caller to present again remaining data.

Let's imagine a silly scenario : the caller asks to compress 1 MB into a 1 byte destination buffer.

ZSTD_compressContinue() would just fail. But not ZBUFF.

It will start by loading as much as it can (128 KB), compress it, and then flush the result into destination buffer. At this stage, it will see that it can only flush 1 byte, and keep the rest into its own internal buffer.

The result of ZSTD_compressContinue() will be : "I've read 128 KB, and written 1 byte".

The caller must check these return values. In this example, it means that the 1 MB is not fully consumed. So on next invocation of ZSTD_compressContinue(), he must present remaining data again, starting from 128 KB position.

When I pass full 128 KB input buffers (with an sufficiently large output buffer), in what case can this happen?

If you always provide sufficiently large output buffer, this case will never happen.

But remember, in a previous question, you said that "allocating 128 KB + 3 + 3 is not really practical". It's okay, you don't have to, 128 KB will be more than enough in most circumstances. But in doing so, it creates a limit condition, where ZBUFF would not be able to flush the entire output buffer. It's not a problem : ZBUFF will use next opportunity to flush the remaining part. But now always check the return values, as maybe at some point, ZBUFF won't be able to consume the input because it's already too busy trying to flush what's left inside its own buffer.

KrzysFR commented 8 years ago

Thanks for all these answers, it helps a lot.

ZBUFF API will produce fully compatible zstd frames. The result of its compression can be decoded by ZSTD_decompress(). And inversely, ZBUFF decompression API can decode frames produced by ZSTD_compress().

That simplifies things a lot!

The reasoning is : if the object to compress is small, ZSTD_compress() is probably a better choice. If the object to compress is so large that its size is not known, then it will probably be too large for the decoder too.

Well yes, and no, it will depend on what "large" means here: In case where the input size cannot be predicted and is usually small, but sometimes "large" up to 10's or 100's of MB (ex: Document Store), I will probably not want to maintain two different code paths, and go straight with a Stream for everyone. And typically for Document Store, that own their storage engine, they will use either page caching or most probably memory mapped files, where the compressed output is still in main memory. Reading the last N bytes (or any random byte) of the input does not cost a lot.

Currently, I have to add my own header on top the ZSTD's framing protocol, by simply padding 4 empty bytes in my dest buffer, compressing the data into PTR+4, and then updating the first 4 bytes with the size. I've added this extra padding to my own API, but this looks awkward.

Although, it is true that I would need to do the same if I was using any other codec (gzip, lz4, ...) and probably also will need to add my own header with the codec used, input size, optional arguments (ex: dictionary id for zstd), before the compressed blob itself.

You don't need to. ZBUFF API accepts any buffer size.

Mmm, so if ZBUFF keeps its own internal buffers, do I really need to also have my own 128 KB buffers on top of it? If I simply used smaller input and output buffers, would I get the same compression ratio? (at the cost of more calls to ZBUFF_compressContinue() or course).

If I could get by only allocating 64 KB buffers from my side (managed memory), and let ZBUFF allocates its own memory (native memory, which does not slow down .NET's GC) this would be ideal, because in .NET, any object larger than 85,000 bytes goes straight to the Large Object Heap, while smaller objects can be GCed more quickly...

Cyan4973 commented 8 years ago

if ZBUFF keeps its own internal buffers, do I really need to also have my own 128 KB buffers on top of it?

Nope.

If I simply used smaller input and output buffers, would I get the same compression ratio?

Yes

If I could get by only allocating 64 KB buffers from my side (..) this would be ideal

Yes, you can

Cyan4973 commented 8 years ago

Currently, I have to add my own header on top the ZSTD's framing protocol, by simply padding 4 empty bytes in my dest buffer, compressing the data into PTR+4, and then updating the first 4 bytes with the size.

Indeed, this information is added at the beginning, it's more useful and accessible there.

Obviously, this construction is also specific to your application. A generic streaming implementation cannot presume a capability to seek to the beginning nor the end of a stream. This restriction is valid in both direction, reading and writing. It also cannot presume that 4 bytes will always be enough to represent the length of its content. Some content can be very large. But your application can.

The closest thing I could propose is a curious construction in which the frameContentSize field is present in the frame header but set to 0 (which means "unknown"). Left as is, it's just a waste of space, but later on, after compression is completed, another function could be called to modify this field directly into the header.

This is weird, as the header is not supposed to be modified independently, and it can introduce errors (what if the inserted size if wrong ?). But the good news is, it would not require any change to the frame format. That means this capability could be created, any time, in the future.

KrzysFR commented 8 years ago

Streams in .NET have the CanSeek property, which tells them that seeking is allowed, and Length will probably return the correct value.

But I guess these are .NET-centric concerns, that could be seen as out of the scope of the ZSTD API. Most applications will probably need to add their own metadata anyway, by way of an additional header, or by storing metadata alongside the compressed blob.

KrzysFR commented 8 years ago

I've updated to 0.7.0, and without changing anything to the code on the test that compress a highly redundant text in 3 blocks (so still using 3 calls to ZSTD_compressContinue, not switched to ZBUFF yet), things have changed for the better it seems.

.NET code untouched, same input, only the version of the zstdlib dll in the folder changed. Both outputs decompress just fine.

With 0.6.0:

Plain Text input size: 308,736
ZSTD_compress() output size: 2,588
Streamed result size : 7,664
- chunk#0: 131,072 -> 2,551
- chunk#1: 131,072 -> 2,555
- chunk#2: 46,592 -> 2,558

With 0.7.0:

Plain Text input size: 308,736
ZSTD_compress() output size: 2,580
Streamed result size : 5,120
- chunk#0: 131,072 -> 2,550
- chunk#1: 131,072 -> 2,554
- chunk#2: 46,592 -> 16

It seams the last chunk was properly optimized in 0.7 compared to 0.6. Any idea why? Is this some random thing due to freak alignment of the 1 byte different in the compressed size?

KrzysFR commented 8 years ago

I've tried with variations of the input data (same text repeated up to 300kb), and the same pattern occurs with the last chunk always being 16 bytes in 0.7.0 so probably not due to chance on the input.

Cyan4973 commented 8 years ago

Did you, by any chance, set windowLog to 17 ?

KrzysFR commented 8 years ago

I did not do anything :)

I'm calling ZSTD_compressBegin(...) with compressionLevel set to 1.

Cyan4973 commented 8 years ago

I'm embarrassed then, as I'm unable to reproduce this issue. Inter-blocks references seem to work fine in my tests.

Is there any code you could share to show the issue ?

KrzysFR commented 8 years ago

If this wasn't clear: this is not an issue, it's just that 0.7.0 does a better job at compressing the last block, compared to 0.6.0, and the result compressed with 0.7.0 decompresses just fine

I was just wondering if this was expected from an optimization somewhere, or freak chance.

Cyan4973 commented 8 years ago

If I understand, your data consists of a single repeated sentence.

So what would be expected is that all blocks after the first one get compressed to 16 bytes.

This depends on windowLog though. But you said you did not influenced it, so using ZSTD_compressBegin(..., 1); should result in a windowLog value of 19, which should be good enough.

I don't know if your blocks are consecutive or separated, nevertheless, with a windowLog of 19, it should work in both cases.

I understand it's not a "crash bug" nor a regression, since 0.7 does better here than 0.6. But I'm nonetheless puzzled by the result, and use it as an opportunity to dig in further.

You initialize using ZSTD_compressBegin(), not ZSTD_compressBegin_usingDict() ?

KrzysFR commented 8 years ago

The text is the lyrics of IAM - L'Empire du Coté Obscur :microphone: repeated until I get > 300K, chosen to make about 2 and a half blocks.

Initialized with ZSTD_compressBegin(.., 1), then three calls to ZSTD_compressContinue(...) each time reusing the same buffer. First 2 calls are full (131,072) last call only has 46,592 bytes. Then ZSTD_compressEnd().

In 0.6.0: all three calls to ZSTD_compressContinue() (whether with full buffer or not) produce ~2500 bytes.
In 0.7.0: only the last call (partial) returns 16 bytes.

In both cases, the second call does NOT return 16 bytes, but from the discussion above, it was established that this is because I'm reusing the buffers, and that I should use ZBUFF instead.

So again, maybe this is simply a corner case in ZSTD_compressContinue on weird input...

Cyan4973 commented 8 years ago

from the discussion above, it was established that this is because I'm reusing the buffers

ah ok, forgot about that. So the 300K are into a file (not a single continuous memory segment) which is loaded by blocks of 128KB into the same 128KB buffer.

Indeed, so if src pointer is always the same, ZSTD_compressContinue() presumes the previous content was overwritten (which is correct), and does not use it.

Last block get a bit more lucky : since it's not complete, it does not completely erase previous buffer. Therefore, 131072 - 46592 = 84480 bytes are considered still available. Using them, the last block can be properly compressed.

Not sure why it did not worked in 0.6, but the current behavior in 0.7 is indeed the proper one.

KrzysFR commented 8 years ago

Ohh ok, then if ZSTD_compressContinue() can be smart enough to reuse the tail of the buffer, this is even more "magical behavior" that needs to be explicitly told to the developer!

I have the habit of clearing or padding with a marker the tail of reused buffers, to help catch errors where I would store more than intended... In this case this would have introduced corruption.

I think the documentation of ZSTD_compressContinue should be clear about this fact (both impact on compression ratio, and that it can use even the tail of the buffer)

Cyan4973 commented 8 years ago

Sure.

It's a direct consequence of ZSTD_compressContinue() ability to use previously supplied input buffers to compress the current one, except when detected overwritten (src srcSize fields). This is indeed uncommon and deserve to be better document.

I will also discourage its usage in "normal circumstances". The official streaming interface should be ZBUFF, and that should be made clearer too.

KrzysFR commented 8 years ago

I finally had some time to use the ZBUFF API in my stream implementation, and it does indeed produce an optimally compressed output, smaller than the result of ZSTD_compress() by 3 or 4 bytes.

A bit of feedback on this API, from the point of view of a managed runtime

I'm not entirely sure how I should interpret the various results returned by ZBUFF_compressContinue: It seems that the amount of data written to dst and consumed from src are written to *dstCapacityPtr and *srcSizePtr

But I'm a bit confused how I should use the return value:

The current algorithm fills a fixed-size input buffer, and when it is full, it will try to compress it completely, via a fixed-size temp buffer, and flush it to the destination stream. Once done, the input buffer is cleared, and we wait for more input.

In this scenario, even if ZBUFF_compressContinue tells me that I can write another X bytes to fill the buffer, I may not have enough remaining in my buffer (or have much more), in which case I have to rely on my own arithmetic using the value stored at srcSizePtr, and ignore the return value (except for error checking).

I think that the current API targets people who already have the full input in memory, in which case they use the return value for the next call. But when the input is buffered, the caller should be the one in charge of the input buffer?

Also, I need to juggle with two buffers (src and dst which can become empty/full independently from one another), meaning that I have two conditions that require me to call ZBUFF_compressContinue again for the same chunk.

It makes the logic a bit more complex, and require a lot of unit tests to make sure this is correct in all cases (for any buffer size, with compressible or uncompressible data, ...)
I guess a reference implementation would help a lot for people having to port it to their favorite languages: one for input already in memory, and one for input read from a file. Also, with dst buffer large enough (using ZSTD_compressBound) and one smaller (fixed size).

The API requires pointers to variable sized integers (32 or 64 bits) which is a bit difficult to do for runtimes where all integers have a fixed size regardless of the architecture:

In .NET, all integers have a fixed size (int/uint = 32 bits, long/ulong = 64 bits). The only types whose size will change depending on the architecture are IntPtr and UIntPtr.
Unfortunately, these are very poor and don't support integer arithmetic yet (https://github.com/dotnet/coreclr/issues/963), so no easy way to +, - or += with pointers. I need to use temporary variables, and convert to/from uint around each call to a ZBUFF_compressXXX() method.
When working with fixed size buffers, I don't see how you would need more than 32 bits for sizes (unless you are allocating buffers larger than 4 GB??)
No other methods seem to require size_t* arguments, which makes this issue unique to the ZBUFF_compressXXX() methods.

KrzysFR commented 8 years ago

Argh, I forgot that GitHub adds a back reference to the target Issue, when you reference it from another repository. Oh well... :hushed:

Cyan4973 commented 8 years ago

Thanks for your very useful feedback @KrzysFR .

I believe one of our next major topic will be to stabilize a good streaming implementation.

From this discussion, I understand that :

a separate zbuff.h makes it less visible. Maybe it should be merged directly into zstd.h instead.
zbuff name doesn't make it clear it's a streaming implementation. Maybe name should be changed too, for example to zstream .
The currently available prototypes could be improved
- read-write arguments are not much liked (even though the idea was borrowed from zlib API which is supposed to be well known and proven already)
- size_t* can be a pain to use from C#, and maybe from other managed languages as well

When working with fixed size buffers, I don't see how you would need more than 32 bits for sizes (unless you are allocating buffers larger than 4 GB??)

Well, one of the regular complaints I get from LZ4 API is that it doesn't accept input buffers >= 2 GB. The recommendation is to cut such input into smaller pieces, which has always worked so far. But it requires time, support, additional implementation layer, etc. So, yes, such use case happen sometimes.

That being said, for a streaming implementation, it seems even more obvious that the work-around is to call it again with remaining buffer content.

I'm a bit confused how I should use the return value:

You don't have to. It's only an hint. Using it tends to simplify a few things inside the codec, so it is better for speed. But really, it's totally secondary. Better make it simpler for you to manage your buffers.

I need to juggle with two buffers (src and dst which can become empty/full independently from one another), meaning that I have two conditions that require me to call ZBUFF_compressContinue again for the same chunk.

Yes. I'm afraid this property is directly related to this type of streaming : if it does not impose condition on dstBufferSize, then it cannot guarantee consuming all input.

The most important condition is to ensure the input buffer is fully consumed. If it is, you can simply move on to next input buffer. Note that as long as dstBufferSize >= ZSTD_compressBound(srcSize), then it should successfully consume the entire input at each call.

The requirement to completely fill the output buffer before moving on to the next one is specific to your application. So it can only be simplified from within your application.

Now for an opened question : If you had the freedom to choose the streaming interface you wanted, which one would you want ?

KrzysFR commented 8 years ago

For coreclr people wondering what this issue is about:

I need to P/Invoke into a C library that has a lot of size_t or size_t* arguments, from .NET code that targets AnyCPU. I currently map them to UIntPtr from the managed side.

Most of the managed code simplifies the variable size problem by only dealing with uint for buffer sizes (even on 64 bits), and casting to/from UIntPtr before/after each PInvoke call.

But some methods have for size_t* arguments (mapped to UIntPtr* or ref UIntPtr) which must be updated before the call, and used after the call to do counter and pointer arithmetic. Currently, I need a separate set of UIntPtr temporary variables, and requires a lot of casting to/from uint when updating pointers.

Current situation

// lib.h: size_t SomePInvokeMethod(void*, size_t*, void*, size_t*)
// .NET: static extern UIntPtr SomePInvokeMethod(byte*, UIntPtr* , byte*, UIntPtr*)

byte[] input = ....; // some input to compress
byte[] buffer = new byte[BUFFER_SIZE]; // let's say 64KB
//
fixed(byte* src = &input[0]);
fixed(byte* dst = &buffer[0])
{
    byte* inPtr = src; // cursor in input buffer
    byte* outPtr = dst; // cursor in output buffer

    uint remaining = (uint) buffer.Length; // safe cast even though `size_t` is unsigned, because buffer.Length is always positive
    uint capacity = (uint) buffer.Length;

    // main compression loop
    while(remaining > 0)
    {
        UIntPtr outSize = (UIntPtr) capacity; // overwritten with amount written to dst after call
        UIntPtr inSize = (UIntPtr) remaining; // overwritten with amount read from src after call

        UIntPtr res = SomePInvokeMethod(outPtr, &outSize, inPtr, &inSize, ...);

        capacity = checked(capacity - (uint)  outSize); // *could* overflow on 32 bits, though unlikely
        outPtr += (uint) outSize; // I should probably also checked(...) this ...
        remaining = checked(remaining - (uint) inSize); // same has above
        inPtr += (uint) inSize;
        if (capacity == 0) { /* ... some logic .... */ }
    }
}

It would be much nicer to be able to do:

// lib.h: size_t SomePInvokeMethod(void*, size_t*, void*, size_t*)
// .NET: static extern nativeint SomePInvokeMethod(byte*, ref nativeint, byte*, ref nativeint)

byte[] input = ....; // some input to compress
byte[] buffer = new byte[BUFFER_SIZE]; // let's say 64KB
//
fixed(byte* src = &input[0]);
fixed(byte* dst = &buffer[0])
{
    byte* inPtr = src; // cursor in input buffer
    byte* outPtr = dst; // cursor in output buffer

    nativeint remaining = input.Length; // size remaining in input buffer
    nativeint capacity = buffer.Length; // size remaining in output buffer
    // main compression loop
    while(remaining > 0)
    {
        nativeint outSize = capacity; // no casting required
        nativeint inSize = remaining;

        nativeint res = SomePInvokeMethod(outPtr, ref outSize, inPtr, ref inSize, ...);

        capacity -= outSize; // can use `-=` on integer
        outPtr += outSize; // can use `+=` on pointer
        remaining -= inSize;
        inPtr += inSize;
        if (capacity == 0) { /* ... some logic .... */ }
    }
}

KrzysFR commented 8 years ago

I'm playing with various internal buffer sizes, and trying to change my algorithm so that I can decouple compressing (when input is full) from flushing to disk (when output is full). This way I can always write nice round 64KB buffers to disk (plays nicely with FILE_FLAG_WRITE_THROUGH on Windows)

While doing so, I encountered an issue where ZBUFF_compressEnd() can return a negative value, that is NOT recognized as an error by ZBUFF_isError.

The test uses a not very compressible suite of random numbers (in ASCII), and uses 64 KB buffer for input/output, chosen to make sure that I hit the path where ZBUFF_compressContinue and ZBUFF_compressEnd have to stall accepting input, because the output buffer is full. That way I can test all the various code paths.

The call to ZBUFF_compressEnd happens when the dst buffer is almost full (only 6,076 bytes remaining out of 65,536), but with still a lot of unflushed data in the compression context:

First call to ZBUFF_compressEnd(...) with dst=OUT_PTR + 59460 and *dstMaxSize = 6076 fills the dst buffer completely (post-call, *dstMaxSize == 6076), and returns 7005 which is positive so there are that many bytes remaining in the context.
I flush my output buffer to disk, so it become empty again.
Second call to ZBUFF_compressEnd(...) with dst=OUT_PTR + 0 and *dstMaxSize = 65536 (now empty) returns the value -58539 which is negative!
ZBUFF_isError(-58539) returns false which makes my code throw later with an OverflowException when I convert it to an unsigned int (the morale here is always to use checked(...) to do pointer and buffer arithmetic in .NET to detect these situations!)

The value -58539 could be the result of 7005 - 8 - 65536, where 7005 is the return value from the previous call, and 8 could be some framing bytes?

The return value comes from this code, so this would mean that the content of zbc is not updated properly?

size_t ZBUFF_compressFlush(ZBUFF_CCtx* zbc, void* dst, size_t* dstCapacityPtr)
{
    size_t srcSize = 0;
    ZBUFF_compressContinue_generic(zbc, dst, dstCapacityPtr, &srcSize, &srcSize, 1);  /* use a valid src address instead of NULL */
    return zbc->outBuffContentSize - zbc->outBuffFlushedSize;
}

Cyan4973 commented 8 years ago

This would qualify as a bug. Let's look into it.

First, I need a way to reproduce the bug...

Note : zbufftest is supposed to torture-test this interface, using fuzzing techniques. So it's surprising, as some "equivalent" conditions should have already been tested there...

KrzysFR commented 8 years ago

Input is 948,456 bytes of random digits (ASCII), compresses to 400,225 bytes with ZSTD_compress(level 1). Input and output buffers are 65,536 bytes.

Here is the output log of my test.

Legend:

OutputBlock(N) is called when input buffer is full with N bytes (so always 65536 except last call) that must be compressed. It clears the input buffer when done
EmitBlock(N) is called when the output buffer is full with N bytes that must be written to disk. It clears the output buffer when done
Last call to OutputBlock(..) will call ZBUFF_compressEnd() at the end
CompressContinue(A, B, C, D): A and C are byte pointers expressed as offset to the respective start. B and D are the content of the size pointers
> RES, outRemaining=B, inRemaining=D, RES is the return value, B and D are the content of the updated size pointers.

% OutputBlock(65536)
%% CompressContinue(0, 65536, 0, 65536)
%% > 65536, outRemaining=0, inRemaining=65536
% > inFill=0, outFill=0
% OutputBlock(65536)
%% CompressContinue(0, 65536, 0, 65536)
%% > 131072, outRemaining=55327, inRemaining=65536
% > inFill=0, outFill=55327
% OutputBlock(65536)
%% CompressContinue(55327, 10209, 0, 65536)
%% > 65536, outRemaining=0, inRemaining=65536
% > inFill=0, outFill=55327
% OutputBlock(65536)
%% CompressContinue(55327, 10209, 0, 65536)
%% > 131072, outRemaining=10209, inRemaining=65536
% EmitBlock(65,536, flush: False)
% > inFill=0, outFill=0
% OutputBlock(65536)
%% CompressContinue(0, 65536, 0, 65536)
%% > 65536, outRemaining=45095, inRemaining=65536
% > inFill=0, outFill=45095
% OutputBlock(65536)
%% CompressContinue(45095, 20441, 0, 65536)
%% > 131072, outRemaining=20441, inRemaining=65536
% EmitBlock(65536, flush: False)
% > inFill=0, outFill=0
% OutputBlock(65536)
%% CompressContinue(0, 65536, 0, 65536)
%% > 65536, outRemaining=34855, inRemaining=65536
% > inFill=0, outFill=34855
% OutputBlock(65536)
%% CompressContinue(34855, 30681, 0, 65536)
%% > 131072, outRemaining=30681, inRemaining=65536
% EmitBlock(65536, flush: False)
% > inFill=0, outFill=0
% OutputBlock(65536)
%% CompressContinue(0, 65536, 0, 65536)
%% > 65536, outRemaining=24606, inRemaining=65536
% > inFill=0, outFill=24606
% OutputBlock(65536)
%% CompressContinue(24606, 40930, 0, 65536)
%% > 131072, outRemaining=40930, inRemaining=65536
% EmitBlock(65536, flush: False)
% > inFill=0, outFill=0
% OutputBlock(65536)
%% CompressContinue(0, 65536, 0, 65536)
%% > 65536, outRemaining=14387, inRemaining=65536
% > inFill=0, outFill=14387
% OutputBlock(65536)
%% CompressContinue(14387, 51149, 0, 65536)
%% > 131072, outRemaining=51149, inRemaining=65536
% EmitBlock(65536, flush: False)
% > inFill=0, outFill=0
% OutputBlock(65536)
%% CompressContinue(0, 65536, 0, 65536)
%% > 65536, outRemaining=4149, inRemaining=65536
% > inFill=0, outFill=4149
% OutputBlock(65536)
%% CompressContinue(4149, 61387, 0, 65536)
%% > 131072, outRemaining=55311, inRemaining=65536
% > inFill=0, outFill=59460
Writes    : 948,456 bytes in 3,143 chunks
% OutputBlock(30952) FINAL BLOCK!
%% CompressContinue(59460, 6076, 0, 30952) // <-- last call to compress the tail of the input buffer
%% > 100120, outRemaining=0, inRemaining=30952 <-- all input consumed
%% CompressEnd(59460, 6076) // <-- called with almost full dst buffer
%% > 7005, outRemaining=6076
% EmitBlock(65536, flush: False)  // <-- write full buffer to disk
%% CompressEnd(0, 65536)  // <-- called again with emtpy dst buffer
!OVERFLOW! res = -58539

Cyan4973 commented 8 years ago

Bug reproduced :+1: Now onto chasing it ...

Cyan4973 commented 8 years ago

Tentative fix uploaded in the "dev" branch.

KrzysFR commented 8 years ago

It now works with the latest commit. I'm going to play with the buffer sizes to see if I can find more corner cases.

KrzysFR commented 8 years ago

Just noticed now that the first call to ZBUFF_compressEnd returns 7,002 bytes remaining, but the next calls still writes 7,005 bytes to the output.

I guess that the value return, when > 0, does not include additional framing. This means that even if I validate that it fits in my buffer just right, it could still need 3 more bytes, and require a third call ?

KrzysFR commented 8 years ago

Log after the fix: (in renamed a few methods in between)

% CompressInputBlock(30952, finalBlock: False)
%% CompressContinue(dst=OUT_PTR+59460, *dstMaxSize=6076, src=SRC_PTR+0, *srcSize=30952)
%% -> 100120, *dstMaxSize==0, *srcSize=30952
%% CompressEnd(dst=OUT_PTR+59460, *dstMaxSize=6076)
%% -> 7002, *dstMaxSize=6076
% EmitOutputBlock(65,536, flush: False)
%% CompressEnd(dst=OUT_PTR+0, *dstMaxSize=65536)
%% -> 0, *dstMaxSize=7005
% EmitOutputBlock(7,005, flush: True)
% > inFill=0, outFill=0

Cyan4973 commented 8 years ago

Correct. The previous bug was related to a faulty attempt at solving this issue (multiple calls). So my first action was to removed it.

That being said, I believe it can be done safely now. Let me have a second look.

KrzysFR commented 8 years ago

The point being that we would be able to decide whether to flush the output buffer or not, just by looking at the return value of ZBUFF_compressEnd.

Cyan4973 commented 8 years ago

Latest update in the "dev" branch should solve this issue too.

KrzysFR commented 8 years ago

Yes. I have also added calls to ZBUFF_compressFlush to handle explicit call to stream.Flush(). the final compressed size changes slightly depending on how often and where it is called, so I suppose that it closes the current block.

But does it mean that, if the stream was interrupted after a flush, the content COULD be read back up to that point? Other comments in zbuff.h seem to imply that unless ZBUFF_compressEnd is called, the decoder will reject the data.

Cyan4973 commented 8 years ago

Calling ZBUFF_compressFlush() basically close the current block, without waiting for it to be filled. So indeed the result will be different depending on its usage.

Other comments in zbuff.h seem to imply that unless ZSTD_compressEnd is called, the decoder will reject the data.

Reject is a bit too strong. Here is the quote :

The epilogue is required for decoders to consider a frame completed.

You can partially decompress a stream. Decoding is done block by block, so it's possible to stop at any block border.

But, if you are waiting for the decoder to tell you that "the frame is completed == fully decoded", you will never receive such message without the epilogue.

Cyan4973 commented 8 years ago

Fix pushed to master

facebook / zstd

Weird issues with using the streaming API vs `ZSTD_compress()` #206