facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.18k stars 2.06k forks source link

Version Information for debugging #2275

Closed nand28 closed 3 years ago

nand28 commented 4 years ago

Hi,

ZSTD guarantees successful decompression of data compressed with older versions of zstd. But is the same guaranteed vice versa? That is if we need to compress with newer version of zstd (at one node) and decompress with older version of zstd (at another node). If that can break at any point in future, we would like to identify such cases by knowing the version with which the data is compressed. The version of zstd which is currently used for compression can be obtained via ZSTD_version() api but similarly, can we obtain the zstd version of the compressed data? Like an optional checksum appended to the compressed data, can we have an optional version number appended to the compressed data and the same is verified for compatibility during decompression?

Also, is optional checksum allowed for single step compression using context?

Thanks.

felixhandte commented 4 years ago

Hi @nand28,

Yes, it is intended that both old-to-new and new-to-old roundtrips should always work (as long as both endpoints are >=v0.8). The compression format is stable.

The only exception to that is that we sometimes find bugs in the decompressor and fix them (you can search closed PRs with the bug tag, if you want to see examples).

can we obtain the zstd version of the compressed data?

The first four bytes of the compressed frame are, in effect, a version identifier. "\x28\xB5\x2F\xFD" indicates that the blob uses the stable (v0.8+) format.

Also, is optional checksum allowed for single step compression using context?

Yes, you can enable checksumming on any frame, though you may have to use a different API entry point than you are used to in order to do so. For example, ZSTD_compressCCtx ignores any configuration in the cctx; it only uses the cctx for efficiency. So you'd want to use something like ZSTD_compress2() instead:

ZSTD_CCtx_reset(cctx, ZSTD_reset_session_and_parameters); /* clear the cctx */
ZSTD_CCtx_setParameter(cctx, ZSTD_c_compressionLevel, 3); /* set compression level, etc. */
ZSTD_CCtx_setParameter(cctx, ZSTD_c_checksumFlag, 1); /* enable checksumming */
size_t dstSize = ZSTD_compress2(cctx, dst, dstCapacity, src, srcSize);
assert(!ZSTD_isError(dstSize));

See this comment for more.

nand28 commented 4 years ago

Thanks Felix for the quick response.

In case of unexpected bugs during decompression, we would like to detect the exact version of the zstd with which the data was compressed with. The magic, provides info only about the stable release and we will mostly be using zstd versions later to this. Is it possible to accommodate version in the compressed data?

For checksum inclusion, is it necessary to reset the context everytime as that could be adding to delays. I mean, what are the sticky parameters that should be taken care while reusing context for singlestep compression?

Does decompression validate the data with the checksum stored?

felixhandte commented 4 years ago

Is it possible to accommodate version in the compressed data?

There aren't any fields in a zstd frame that are really appropriate for that purpose. You could however append a skippable frame with whatever metadata you want.

Is it necessary to reset the context everytime as that could be adding to delays?

If you want to use the same parameters every time there's no need to reset. Though it should also be noted that resetting the context is extremely cheap.

Does decompression validate the data with the checksum stored?

Yes.

Cyan4973 commented 4 years ago

The skippable frame is designed for watermarking.

The content of the skippable frame is yours, you can add anything you want into it, from the zstd library version, to emitter identity, source host, parameters, PID, etc. This capability is frequently useful when debugging network applications.

Note though that the content is not free, and will consume bandwidth. So in regular production mode, you will likely want to limit the bandwidth impact of watermarking, either reducing the payload, and/or sampling, or disable it altogether.

nand28 commented 4 years ago

Thats cool.

These skippable frames have to be appended to compressed buffer or are there any apis for skippable frame handling? I hope decompression can include skippable frame size in the src size and that will be skipped anyways. Does using skippable frame requires to handle decompression like multiframe decompression?

Upon checksum validation failure, what is the error returned? Is there any way to read this checksum?

Cyan4973 commented 4 years ago

These skippable frames have to be appended to compressed buffer

Skippable frames can be in front or after a zstd frame. The only condition is to be concatenated directly (no padding). From a decompression perspective, position doesn't matter : they will be skipped anyway.

From an exploitation perspective it may matter : For example, it's generally easier to detect the skippable frame when it's in front.

are there any apis for skippable frame handling?

Very little. The main document is the format specification, which is fairly simple when it comes to skippable frames.

There is no api to generate a skippable frame. There is also no api to exploit the content of a skippable frame. Both topics are within user's domain of responsibility. The only guarantee from the decoder is that it will skip over them, ignoring them.

There is however an (optional) api symbol able to detect a skippable frame. This is ZSTD_getFrameHeader().

Note though that this function is part of the advanced API, which is not labelled stable, so it requires ZSTD_STATIC_LINKING_ONLY to be exposed. Reason is, it depends on a structure definition, and any change to this structure will break the ABI. It's safe to use in a static linking scenario, since the version is guaranteed at compile time, but it's not safe to use in combination with dynamic library, since library version will be discovered at runtime, and it could happen that a different version of the library defines the structure slightly differently.

I hope decompression can include skippable frame size

The content size of the skippable frame is part of its header. It's known by the decoder, and reported by ZSTD_getFrameHeader().

Does using skippable frame requires to handle decompression like multiframe decompression?

This is a form of multi-frames.

Impact on decoding stage varies, from being completely transparent, when invoking ZSTD_decompress() for example, to generating an additional "stop at end of frame" event in streaming mode (ZSTD_decompressStream()).

Upon checksum validation failure, what is the error returned?

ZSTD_error_checksum_wrong

Is there any way to read this checksum?

Read the last 4 bytes from the compressed frame. No dedicated method is provided by the API to extract the checksum value, though ZSTD_getFrameHeader() will tell if the checksum is present or not.

nand28 commented 4 years ago

Thanks Yann for clarifying.

Can the content size of skippable frame be zero at times? Will it have any impact?

Cyan4973 commented 4 years ago

Can the content size of skippable frame be zero at times?

Yes it can

Will it have any impact?

Well, it just occupies 8-bytes. That's all.

nand28 commented 4 years ago

Thanks for all the clarifications.

Read the last 4 bytes from the compressed frame. No dedicated method is provided by the API to extract the checksum value, though ZSTD_getFrameHeader() will tell if the checksum is present or not.

Can we get an API around this to return the content checksum given a compressed buf, if the checksum flag is present?

Also, is this checksum calculation done as a separate pass of the input data? Or is it calculated in the same pass as the compression happens? In short, if we enable content checksum, will the compression time increase by the time taken by xxH64() to compute the checksum?

Cyan4973 commented 4 years ago

Can we get an API around this to return the content checksum given a compressed buf, if the checksum flag is present?

To be fair, I don't see a good enough use case to justify an additional entry point. What's the goal of accessing the checksum value ? Replay the hash algorithm on the user side ? Knowing it was already used and controlled automatically by the decoder ? The only use case I can think of is debugging, and that's not enough to justify increasing API complexity.

is it calculated in the same pass as the compression happens?

Yes

if we enable content checksum, will the compression time increase by the time taken by xxH64() to compute the checksum?

Yes. Though don't expect it to be a major contributor, especially on the compression side. XXH64() is very fast, that's why it was selected. It will contribute, at most, to a few % of total compression process.