HDF5 crashes with inefficient compressors

lucasvr commented 3 years ago

I noticed that HDF5 crashes when I/O filters produce more data than the original dataset size.

When a dataset is created, its declared dimensions + data type are naturally honored when it comes the time to write the data with H5Dwrite. The I/O filter interface, however, allows a compressor to either return a number that’s smaller than that (in which case it successfully compressed the data) or slightly larger (in which case the compressor didn’t do a good job).

Now, let’s say we have a really bad compressor which requires 100x more room than necessary. What I observe is that HDF5 seems to truncate the data, so it’s not possible to retrieve it afterwards. In some cases, HDF5 even crashes when the dataset handle is closed.

Here’s an example I/O filter that reproduces the problem.

// build with 'g++ liberror.cpp -C -o libtestcrash.so -shared -fPIC -Wall -g -ggdb'
#include <hdf5.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <string.h>

extern "C" {

size_t callback(unsigned int flags, size_t cd_nelmts, const unsigned int *cd_values, size_t nbytes, size_t *buf_size, void **buf)
{
    if (flags & H5Z_FLAG_REVERSE) {
        return *buf_size;
    } else {
        char *newbuf = (char *) calloc(1000*1000, sizeof(char));
        free(*buf);
        *buf = newbuf;
        *buf_size = 1000*1000;
        return *buf_size;
    }
}

const H5Z_class2_t H5Z_UDF_FILTER[1] = {{
    H5Z_CLASS_T_VERS, 0x2112, 1, 1, "crash_filter", NULL, NULL, callback,
}};

H5PL_type_t H5PLget_plugin_type(void) { return H5PL_TYPE_FILTER; }
const void *H5PLget_plugin_info(void) { return H5Z_UDF_FILTER; }
}

The corresponding application code is here:

// build with 'g++ mainerror.cpp -o mainerror -g -ggdb -Wall -lhdf5'
// run with 'HDF5_PLUGIN_PATH=$PWD ./mainerror file.h5'
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <hdf5.h>

#define CHECK(hid) if ((hid) < 0) { fprintf(stderr, "failed @line %d\n", __LINE__); exit(1); }

int main(int argc, char **argv)
{
    if (argc != 2) {
        printf("Syntax: %s <file.h5>\n", argv[0]);
        exit(1);
    }
    hsize_t dims[2] = {10, 10};
    hid_t file_id = H5Fopen(argv[1], H5F_ACC_RDWR, H5P_DEFAULT);
    CHECK(file_id);
    hid_t space_id = H5Screate_simple(2, dims, NULL);
    CHECK(space_id);
    hid_t dcpl_id = H5Pcreate(H5P_DATASET_CREATE);
    CHECK(dcpl_id);
    CHECK(H5Pset_filter(dcpl_id, 0x2112, H5Z_FLAG_MANDATORY, 0, NULL));
    CHECK(H5Pset_chunk(dcpl_id, 2, dims));
    hid_t dset_id = H5Dcreate(file_id, "crash_dataset", H5T_STD_I8LE, space_id, H5P_DEFAULT, dcpl_id, H5P_DEFAULT);
    CHECK(dset_id);
    char *data = (char *) calloc(dims[0] * dims[1], sizeof(char));
    CHECK(H5Dwrite(dset_id, H5T_STD_I8LE, H5S_ALL, H5S_ALL, H5P_DEFAULT, data));
    CHECK(H5Dclose(dset_id));
    CHECK(H5Pclose(dcpl_id));
    CHECK(H5Sclose(space_id));
    CHECK(H5Fclose(file_id));
    free(data);
    return 0;
}

If you change the I/O filter code so that it allocates 10x10, or even 100x100, the problem won’t kick in.

lucasvr commented 3 years ago

Just a heads up: the problem seems to be related to the chunking logic. Once the callback returns to the caller (H5D__chunk_flush_entry -> H5Z_pipeline -> callback), H5D__chunk_flush_entry calls H5D__chunk_file_alloc, which fails.

Using the example above, in which the dataset is described as a 10x10 grid, we have it on H5D__chunk_file_alloc that idx_info->layout->size = 100:

(gdb) p *idx_info->layout
$17 = {
  idx_type = H5D_CHUNK_IDX_BTREE,
  flags = 0 '\000',
  ndims = 3,
  dim = {10, 10, 1, 0 <repeats 30 times>},
  enc_bytes_per_dim = 1,
  size = 100,
  nchunks = 1,
  max_nchunks = 1,
   ...

That structure member is used to compute the new chunk size in new_chunk_size_len = (H5VM_log2_gen((uint64_t)(new_chunk->length)) + 8) / 8;. When that computation is compared against the actual chunk size, present in the new_chunk argument, it happens that the chunk became too large to be encoded -- and the call to the function fails.

(gdb) p new_chunk->length
$21 = 1000000

/* Check if the chunk became too large to be encoded */
if(new_chunk_size_len > allow_chunk_size_len)
    HGOTO_ERROR(H5E_DATASET, H5E_BADRANGE, FAIL, "chunk size can't be encoded")

This is just a preliminary investigation -- I hope this helps you somehow.

fortnern commented 3 months ago

Unfortunately I think fixing this would require a file format change, since the width of the encoded chunk size isn't stored in the file, but calculated as (in the filtered case) the minimum number of bytes required to encode the size of an uncompressed chunk + 1. This should allow for chunks to expand by a factor of at least 256, but 10000 may be too big. Is there a practical use case for a filter to expand a chunk by this much?

lucasvr commented 3 months ago

Thanks for coming back to this issue, Neil.

Back when I looked into this issue, I wanted to produce the contents of the dataset on the fly with user-defined functions. In a nutshell, the user defines a dataset (say, a small array that represents data collected by a rain gauge) and, when the application invokes H5Dread() on that dataset, the HDF5-UDF I/O filter loads a bytecode that's stored in the data area of that dataset and runs it (so it can access an IoT server to read the rain gauge measurements, for example).

Because such a bytecode can be much larger than the small array, the I/O filter may need to store more data than the actual number of bytes returned to the application. In other words, it may behave like a pretty bad compressor.

Given the difficulty of fixing this, I guess that the best approach on my side would be to rewrite the HDF5-UDF I/O filter as a VFD or, easier, to simply fail the creation of the dataset if the bytecode size exceeds the data size by more than 256x. As for the HDF5 library, it would be nice if it could fail gracefully rather than crashing, perhaps by returning an error on H5Dwrite() or H5Dclose().

fortnern commented 3 months ago

I think the plan is to modify the file format to always encode chunk length using 64 bits in the next major release.

HDFGroup / hdf5

HDF5 crashes with inefficient compressors #108