Open mkitti opened 6 months ago
ZSTD_getFrameContentSize change has been implemented - but does not check for UNKNOWN.
@byrnHDF If it is helpful, I recently implemented ZSTD_getFrameContentSize
with a check for ZSTD_CONTENTSIZE_UNKNOWN
in C++ for numcodecs.js via WASM. The code is mostly just plain C though:
That does help to see the logic. However, wouldn't this only be a problem if someone created an hdf5 file outside of the hdf5 library? Because the HDF5 filter wouldn't be using the stream version?
The HDF5 C library does offer multiple APIs to read and write raw chunks. In particular, one could write a raw chunk via H5Dwrite_chunk
. In this case, a 3rd party could employ Zstandard compression, perhaps even in parallel across multiple threads, and then use H5Dwrite_chunk
to write the chunks using a single thread.
There are multiple instances where this approach is being taken by those interested in accelerated compression:
Thus, we cannot assume that all chunks in a HDF5 container would have been compressed by the compression routine present in this repository. It is possible that streaming compression was used to encode a chunk, and the decompressed size was not encoded in the frame header.
I have specifically encountered this specific issue in real world data when working with with Zarr datasets or N5 datasets. I anticipate that this may occur with HDF5 datasets at some point, and I hope that this reference implementation would be robust enough to deal with any Zstandard compressed data given its increasing ubiquity.
By the way, here are some additional examples in the Zstandard repository itself: https://github.com/facebook/zstd/blob/dev/examples/streaming_memory_usage.c https://github.com/facebook/zstd/blob/dev/examples/streaming_decompression.c
Introduction
The Zstandard plugin for HDF5 should be modified to allow for an unknown decompressed size in the frame header.
Currently, the Zstd decompression scheme, following from the original implemention, uses
ZSTD_getDecompressedSize
to obtain the size of the decompressed buffer. The returned value is not validated and passed directly tomalloc
.https://github.com/HDFGroup/hdf5_plugins/blob/770d70ae73587714629cf5ec139d482c1562e7c1/ZSTD/src/H5Zzstd.c#L59-L60
ZSTD_getDecompressedSize
returns0
if the decompressed size is empty, unknown, or an error has occured. Ifmalloc
is asked to allocate0
bytes, it will returnNULL
, resulting in returning an error condition. This is an incorrect result if the decompressed size is actually empty or unknown and there is no actual error.ZSTD_getDecompressedSize
is obsolete.ZSTD_getFrameContentSize
should replace the use ofZSTD_getDecompressedSize
.ZSTD_getFrameContentSize
distinguishes between empty, unknown, or an error. The unknown or error states are indicated by a return value ofZSTD_CONTENTSIZE_UNKNOWN
orZSTD_CONTENTSIZE_ERROR
, respectively.The unknown decompression state is common. This occurs when the compression is done via the streaming API via
ZSTD_compressStream
orZSTD_compressStream2
.ZSTD_compressStream2
in particular only stores the frame size when eitherZSTD_e_end
is provided on the initial call orZSTD_CCtx_setPledgedSrcSize
is used.Tasks
ZSTD_getFrameContentSize
instead of the obsoleteZSTD_getDecompressedSize
to correctly distinguish between empty, unknown, or error states when determining the decompressed size.ZSTD_getFrameContentSize
againstZSTD_CONTENTSIZE_UNKNOWN
ZSTD_decompressStream
References
[1] https://facebook.github.io/zstd/zstd_manual.html