facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.82k stars 2.11k forks source link

Question and suggestion - behavior with small buffers and errors #4183

Open Eugeny1 opened 3 weeks ago

Eugeny1 commented 3 weeks ago

I am a starter with zstd (using simple ZSTD_compress and ZSTD_decompress) and have three things to say after some research and having specific task.

Most probably I can find the answers looking into the code, but I'd like to state the issues I have at least as a first step.

  1. I need to compress the data and see if compressed size is less than uncompressed (to decide what data to go with). I'd allocate new buffer of the same size of source data and try to compress there. I would expect if compressed data is bigger, I'd get an out of buffer error. However, the following things make me wondering:

    • I read that "bigger buffer makes compression faster". Because logically extrapolating this statement the smaller buffer makes compression slower, and what happens when buffer size is so small it is about to get out of memory (compression stalls)? While my question may sound funny and stupid, I believe it must be addressed.
    • I can't find clear definition of errors returned. Apart from all the errors, I'd like to check for exactly this out of memory error, but struggling to figure out how to do it.
  2. In general, looking at the documentation here herel I was unable to quickly find how to handle the errors except using

    unsigned ZSTD_isError(size_t code); /!< tells if a size_t function result is an error code / const char ZSTD_getErrorName(size_t code); /!< provides readable string from an error code */

with first function saying that there's an error, and second returning the string. But what is exact error? This lists errors, am I expected to compare against those identifiers using ZSTD_getErrorCode()? The documentation I linked to earlier does not mention this function! Also what is the error enum for out of destination buffer issues - ZSTD_error_noForwardProgress_destFull? What does this name mean?

  1. Small, but I found it kind of bothering. When looking at the error enum identifiers, I see some of them label destination as "dest" and some as "dst". That would be very convenient if uniform approach would be selected, so that searching for the keyword shows all the instances related to destination.

Thanks.

Cyan4973 commented 3 weeks ago

I'd allocate new buffer of the same size of source data and try to compress there. I would expect if compressed data is bigger, I'd get an out of buffer error.

Yes, this design works, and is in active usage. A prime example would be compressed filesystems, like BtrFS or ZFS.

Because logically extrapolating this statement the smaller buffer makes compression slower,

Yes, but don't overthink it. The difference is small, and only perceptible at fastest compression levels.

what happens when buffer size is so small it is about to get out of memory (compression stalls)?

We probably assign different meaning to these words. When destination buffer is too small, compression simply aborts, with an error code.

I can't find clear definition of errors returned.

They are all published in zstd_errors.h, which you seem to have found later in the issue.

I expected to compare against those identifiers using ZSTD_getErrorCode()?

Yes

The documentation I linked to earlier does not mention this function!

The API manual is automatically generated from zstd.h, but this specific method is defined in zstd_errors.h.

Ideally, we would duplicate or expose its definition in zstd.h, so that the manual generator picks it up. But this method requires the ZSTD_ErrorCode type, which is in zs2_errors.h.

Possibly including zs2_errors.h from zstd.h could solve that.

edit : oh, but this would require moving zs2_errors.h from "static linking only" mode to "stable" status. Quite an upgrade. But I think we are ready for that now.

what is the error enum for out of destination buffer issues

ZSTD_error_dstSize_tooSmall. You could also find it by creating a local example and looking at the outcome.

I see some of them label destination as "dest" and some as "dst".

The naming convention has mostly converged overtime to use dst. Though there might be some left over using dest here and there.

The only error code using dst is ZSTD_error_noForwardProgress_destFull. All other ones use dst.

While we could consider updating this specific error name to use dst instead, we have a responsibility to keep this interface working for existing applications, meaning the current dest variant must remain available anyway.

Eugeny1 commented 3 weeks ago

Thank you so much for your help and support, much appreciated!

May I have some more suggestions please

The API manual is automatically generated from zstd.h

At the very top of the manual add statement that it is auto-generated using sources. Then people (like me) will be discounting the quality as it was not written by the professional manual writers. To ensure reader's expectations from the very start.

ZSTD_error_dstSize_tooSmall

Add a section with error codes into the manual, with corresponding enum values (names) and descriptions what are they about. I hope ZSTD_getErrorName() are descriptive enough and you can just automatically create the table with corresponding names and description strings. That may help and be a quick reference for newbies.