Blosc / c-blosc

A blocking, shuffling and loss-less compression library that can be faster than `memcpy()`.
https://www.blosc.org
Other
983 stars 158 forks source link

spam on stderr: Output buffer size should be larger than 16 bytes #293

Closed kalvdans closed 4 years ago

kalvdans commented 4 years ago

Writing small arrays in hdf5 with blosc as the filter seems to work fine, but an error message is printed on stderr.

$ h5import /dev/zero -d 1 -o /tmp/foo.hdf5 && h5repack -f UD=32001,0,0,0,0,4,1,2 /tmp/foo{,2}.hdf5
Output buffer size should be larger than 16 bytes
Output buffer size should be larger than 16 bytes
$ h5dump /tmp/foo2.hdf5 
HDF5 "/tmp/foo2.hdf5" {
GROUP "/" {
   DATASET "dataset0" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): 0
      }
   }
}
}

version info:

$ dpkg -l | grep blosc
ii  hdf5-blosc-plugin                                           1.8-0ubuntu1                                amd64        HDF5 Blosc plugin
ii  libblosc1                                                   1.17.1+ds1-1                                amd64        high performance meta-compressor optimized for binary data
estan commented 4 years ago

I can reproduce this issue using c-blosc from master branch using the following steps:

git clone https://github.com/Blosc/hdf5-blosc.git
cd hdf5-blosc
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/tmp/inst -DPLUGIN_INSTALL_PATH=/tmp/inst/lib ..
make -j5 install   # Will clone and install c-blosc from master branch into blosc/
export LD_LIBRARY_PATH=/tmp/inst/lib:$PWD/blosc/lib
export HDF5_PLUGIN_PATH=/tmp/inst/lib
h5import /dev/zero -d 1 -o /tmp/foo.hdf5 && h5repack -f UD=32001,0,0,0,0,4,1,2 /tmp/foo{,2}.hdf5

This was with the following HDF5 on Ubuntu 20.04:

estan@edison:~$ dpkg -l libhdf5-103
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name              Version                 Architecture Description
+++-=================-=======================-============-==================================================================
ii  libhdf5-103:amd64 1.10.4+repack-11ubuntu1 amd64        Hierarchical Data Format 5 (HDF5) - runtime files - serial version
estan@edison:~$
FrancescAlted commented 4 years ago

Yes. Here HDF5 is trying to compress buffers that are 4 bytes long, and C-Blosc clearly cannot do that (the header alone will take 16 bytes). My guess is that blosc_compress() is returning a negative value indicating that the buffer cannot be compressed and then HDF5 just copies data as-is. Now, I suppose that a library like C-Blosc should be quiet in this regard. Hmmm, maybe these kind of informational messages could be issued only when an environment variable (e.g. BLOSC_DEBUG) would be set. Thoughts?

kalvdans commented 4 years ago

The documentation for blosc_compress() states

A negative return value means that an internal error happened. This should never happen.

so if hdf5-blosc-plugin is relying on that to probe for the minimum size it is doing something wrong. It should use BLOSC_MAX_OVERHEAD instead. Let us just verify that the bug is within hdf5-blosc-plugin and then we can close this bug without removing the printout.

FrancescAlted commented 4 years ago

Another (more likely) possibility is that bloc_compress() is returning a zero, which means:

If `src` buffer cannot be compressed into `destsize`, the return
  value is zero and you should discard the contents of the `dest`
  buffer.

In this case, both C-Blosc and hdf5-blosc-plugin are working correctly, and we only need to get rid of the warning (or preferably making it optional via BLOSC_DEBUG environment variable).

estan commented 4 years ago

I confirmed with

diff --git a/src/blosc_filter.c b/src/blosc_filter.c
index bfd8c3e..fab141b 100644
--- a/src/blosc_filter.c
+++ b/src/blosc_filter.c
@@ -218,6 +218,7 @@ size_t blosc_filter(unsigned flags, size_t cd_nelmts,
     blosc_set_compressor(compname);
     status = blosc_compress(clevel, doshuffle, typesize, nbytes,
                             *buf, outbuf, nbytes);
+    printf("status = %d\n", status);
     if (status < 0) {
       PUSH_ERR("blosc_filter", H5E_CALLBACK, "Blosc compression error");
       goto failed;

that blosc_compress is returning -1:

estan@edison:~/hdf5-blosc/build$ h5import /dev/zero -d 1 -o /tmp/foo.hdf5 && h5repack -f UD=32001,0,0,0,0,4,1,2 /tmp/foo{,2}.hdf5
Output buffer size should be larger than 16 bytes
status = -1
Output buffer size should be larger than 16 bytes
status = -1
estan@edison:~/hdf5-blosc/build$

I'm surprised we're not getting an error printed from the HDF5 library, since the filter code handles this case using the following:

#define PUSH_ERR(func, minor, str, ...) H5Epush(H5E_DEFAULT, __FILE__, func, __LINE__, H5E_ERR_CLS, H5E_PLINE, minor, str, ##__VA_ARGS__)
    .
    .
    .
    if (status < 0) {
      PUSH_ERR("blosc_filter", H5E_CALLBACK, "Blosc compression error");
      goto failed;
    }

So it is pushing an error message about this on the HDF5 error stack.

In any case, since the filter failed and reported the failure to HDF5 by returning 0, HDF5 proceeded to save the dataset unfiltered:

estan@edison:~/hdf5-blosc/build$ h5dump -H -p /tmp/foo.hdf5 
HDF5 "/tmp/foo.hdf5" {
GROUP "/" {
   DATASET "dataset0" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 4
         OFFSET 2048
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_LATE
      }
   }
}
}
estan@edison:~/hdf5-blosc/build$

(Note the FILTERS { NONE })

FrancescAlted commented 4 years ago

If C-Blosc is returning a -1, then this is a bug; it should be a 0 (cannot compress instead of an internal error). This should be fixed. And the hdf5-blosc plugin should treat the 0 value differently (i.e. not trying to push an error into HDF5 itself).

Regarding why HDF5 does not throw an error, it would be interesting to know why.

kalvdans commented 4 years ago

If C-Blosc is returning a -1, then this is a bug; it should be a 0 (cannot compress instead of an internal error).

I agree with this analysis. Don't bother with an environment variable; just delete the print statement. Code can be stepped in a debugger if one are curious why it returns 0. I will do a pull request.

estan commented 4 years ago

Regarding why HDF5 does not throw an error, it would be interesting to know why.

It seems the h5repack tool does not configure printing of the HDF5 error stack unless the undocumented option -E is given (see here and here).

If I add that option, I get the HDF5 error trace:

estan@edison:~$ h5import /dev/zero -d 1 -o /tmp/foo.hdf5 && h5repack -E -f UD=32001,0,0,0,0,4,1,2 /tmp/foo{,2}.hdf5
Output buffer size should be larger than 16 bytes
Output buffer size should be larger than 16 bytes
HDF5-DIAG: Error detected in HDF5 (1.10.4) thread 140479245659200:
  #000: ../../../src/H5D.c line 338 in H5Dclose(): can't decrement count on dataset ID
    major: Dataset
    minor: Unable to decrement reference count
  #001: ../../../src/H5I.c line 1364 in H5I_dec_app_ref_always_close(): can't decrement ID ref count
    major: Object atom
    minor: Unable to decrement reference count
.
.
.
  #014: ../../../src/H5Z.c line 1367 in H5Z_pipeline(): filter returned failure
    major: Data filters
    minor: Write failed
  #015: src/blosc_filter.c line 228 in blosc_filter(): Blosc compression error
    major: Data filters
    minor: Callback failed
H5tools-DIAG: Error detected in HDF5:tools (1.10.4) thread 140479245659200:
  #000: ../../../../../tools/src/h5repack/h5repack_copy.c line 354 in copy_objects(): do_copy_objects from </tmp/foo.hdf5> could not copy data to </tmp/foo2.hdf5>
    major: Failure in tools library
    minor: error in function
  #001: ../../../../../tools/src/h5repack/h5repack_copy.c line 1083 in do_copy_objects(): H5Dclose failed
    major: Failure in tools library
    minor: error in function
estan@edison:~$

And the exit status is 255 indicating failure:

estan@edison:~$ h5import /dev/zero -d 1 -o /tmp/foo.hdf5 && h5repack -f UD=32001,0,0,0,0,4,1,2 /tmp/foo{,2}.hdf5
Output buffer size should be larger than 16 bytes
Output buffer size should be larger than 16 bytes
estan@edison:~$ echo $? 
255
estan@edison:~$

If one extends the test case a bit, putting a second dataset /foo in the input file, one can see that this dataset is not present in the output file, since h5repack "gave up":

estan@edison:~$ h5import /dev/zero -d 1 -o /tmp/foo.hdf5 && h5import /dev/zero -d 1 -p /foo -o /tmp/foo.hdf5 && h5repack -f UD=32001,0,0,0,0,4,1,2 /tmp/foo{,2}.hdf5
Output buffer size should be larger than 16 bytes
Output buffer size should be larger than 16 bytes
estan@edison:~$ h5dump /tmp/foo2.hdf5 
HDF5 "/tmp/foo2.hdf5" {
GROUP "/" {
   DATASET "dataset0" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): 0
      }
   }
}
}
estan@edison:~$

So HDF5 is reporting a failure, and h5repack is handling it. It's just a little silent about it.

estan commented 4 years ago

So HDF5 is reporting a failure, and h5repack is handling it. It's just a little silent about it.

Arguably h5repack should handle this better, and not create an output file if the exit code is nonzero, instead of creating partial output.