Weird issues with using the streaming API vs `ZSTD_compress()`

I trying out the streaming API to write the equivalent of .NET's GZipStream (source), and I'm seeing some strange things.

I'm using 0.6.1 for the tests, though the API seems to be unchanged in 0.7 at the moment.

A stream works by having an internal buffer of 128 KB (131,072 bytes exactly). Each call to Write(..) appends any number of bytes to the buffer (could be called with 1 byte, could be called with 1 GB). Everytime the buffer is full, its content is compressed via ZSTD_compressContinue() on an empty destination buffer, and the result is copied into another stream down the line. When the producer is finished writing, it will Close the stream which will compress any pending data in its internal buffer (so anywhere between 1 and 131,071 bytes), call zstd_compress_end, and flush the final bytes to the stream.

Seen from zstd, the pattern looks like:

ZSTD_compressBegin()
ZSTD_compressContinue() 131,072 bytes
ZSTD_compressContinue() 131,072 bytes
...
ZSTD_compressContinue() 123 bytes (last chunk will always be < 128KB)
ZSTD_compressEnd()

I'm comparing the final result, with calling ZSTD_compress() on the complete content of the input stream (ie: storing everything written into a memory buffer, and compress that in one step).

Issue 1: ZSTD_compress() adds an extra empty frame at the start

Looking at the compressed result, I see that usually a single call to ZSTD_compress() adds 6 bytes to the input.

The left side is the compressed output of ZSTD_compress() on the whole file. The right side is the result of streaming with chunks of 128 KB on the same data:

Left size: 23,350 bytes Right size: 23,344 bytes

The green part is identical between both files, only 7 bytes differ right after the header, and before the first compressed frame.

Both results, when passed to ZSTD_decompress() return the same input text with no issues.

Issue 2: N calls to ZSTD_compressContinue() produce N time the size of a single call to ZSTD_compress() on highly compressible data

While testing with some text document, duplicated a bunch of time to get to to about 300KB (ie: the same 2 or 3 KB of text repeated about 100 times), I'm getting something strange

The result of calling zstd_compress on the whole 300KB returns a single 2.5 KB output.
The result of streaming using 3 calls to ZSTD_compressContinue() produces 7.5 KB output (3 times larger).

Looking more closely: each call to ZSTD_compressContinue() returns 2.5 KB (first two calls with 128KB worth of text, third call with only 50 KB), which is too exact to be a coincidence.

Since the dataset is the equivalent of "ABCABCABC..." a hundred times, I'm guessing that compressing 25%, 50% or 100% of it would produce the same output, which would look something like "repeat 'ABC' 100 times" vs "repeat 'ABC' 200 times".

Only, when compressing 25% at a time, you get 4 times as many calls to ZSTD_compressContinue(), which will give you 4 times the output. Compressing 12.5% at a time would probably yield 8 times the output.

When changing the internal buffer size from 128KB down to 16KB, I get a result of 45 KiB, which is about 6x times more than before.

Fudging the input data to get lower compression ratio makes this effect disappear progressively, until a point where the result of the streaming API is about the same as a single compression call (except the weird extra 6 bytes in the previous issue).

facebook / zstd

Weird issues with using the streaming API vs `ZSTD_compress()` #206