Apollo3zehn / PureHDF

A pure .NET library that makes reading and writing of HDF5 files (groups, datasets, attributes, ...) very easy.
MIT License
47 stars 16 forks source link

Large opaque datasets produce overflow #78

Closed Blackclaws closed 2 months ago

Blackclaws commented 2 months ago
          @Apollo3zehn  So with larger opaque datasets I get the following error from h5web:
HDF5-DIAG: Error detected in HDF5 (1.14.2) thread 0: #000: /__w/libhdf5-wasm/libhdf5-wasm/build/1.14.2/_deps/hdf5-src/src/H5D.c line 403 in H5Dopen2(): unable to synchronously open dataset major: Dataset minor: Can't open object #001: /__w/libhdf5-wasm/libhdf5-wasm/build/1.14.2/_deps/hdf5-src/src/H5D.c line 364 in H5D__open_api_common(): unable to open dataset major: Dataset minor: Can't open object #002: /__w/libhdf5-wasm/libhdf5-wasm/build/1.14.2/_deps/hdf5-src/src/H5VLcallback.c line 1980 in H5VL_dataset_open(): dataset open failed major: Virtual Object Layer minor: Can't open object #003: /__w/libhdf5-wasm/libhdf5-wasm/build/1.14.2/_deps/hdf5-src/src/H5VLcallback.c line 1947 in H5VL__dataset_open(): dataset open failed major: Virtual Object Layer minor: Can't open object #004: /__w/libhdf5-wasm/libhdf5-wasm/build/1.14.2/_deps/hdf5-src/src/H5VLnative_dataset.c line 321 in H5VL__native_dataset_open(): unable to open dataset major: Dataset minor: Can't open object #005: /__w/libhdf5-wasm/libhdf5-wasm/build/1.14.2/_deps/hdf5-src/src/H5Dint.c line 1429 in H5D__open_name(): can't open dataset major: Dataset minor: Unable to initialize object #006: /__w/libhdf5-wasm/libhdf5-wasm/build/1.14.2/_deps/hdf5-src/src/H5Dint.c line 1494 in H5D_open(): not found major: Dataset minor: Object not found #007: /__w/libhdf5-wasm/libhdf5-wasm/build/1.14.2/_deps/hdf5-src/src/H5Dint.c line 1756 in H5D__open_oid(): can't retrieve message major: Dataset minor: Can't get value #008: /__w/libhdf5-wasm/libhdf5-wasm/build/1.14.2/_deps/hdf5-src/src/H5Omessage.c line 432 in H5O_msg_read(): unable to read object header message major: Object header minor: Read failed #009: /__w/libhdf5-wasm/libhdf5-wasm/build/1.14.2/_deps/hdf5-src/src/H5Omessage.c line 487 in H5O_msg_read_oh(): unable to decode message major: Object header minor: Unable to decode value #010: /__w/libhdf5-wasm/libhdf5-wasm/build/1.14.2/_deps/hdf5-src/src/H5Oshared.h line 74 in H5O__fill_new_shared_decode(): unable to decode native message major: Object header minor: Unable to decode value #011: /__w/libhdf5-wasm/libhdf5-wasm/build/1.14.2/_deps/hdf5-src/src/H5Ofill.c line 291 in H5O__fill_new_decode(): ran off end of input buffer while decoding major: Object header minor: Address overflowed

I'm not sure whether the issue comes from the way the hdf5 is encoded on PureHDFs side or from h5web.

I've tried 1Mb images here.

h5py also isn't happy with it:

>>> v = file["group"]["opaque"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/felix/.cache/pypoetry/virtualenvs/net8-0-xM1W9L25-py3.11/lib/python3.11/site-packages/h5py/_hl/group.py", line 357, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 241, in h5py.h5o.open
KeyError: 'Unable to synchronously open object (ran off end of input buffer while decoding)'

Originally posted by @Blackclaws in https://github.com/Apollo3zehn/PureHDF/issues/76#issuecomment-2077207473

Apollo3zehn commented 2 months ago

v1.0.0-beta.14 should fix both issues. This issue was caused by the fact that PureHDF - as a workaround to a HDF5 issue - always wrote the fill value into the file. This is not a problem for standard data types like double where it is only 8 bytes. But for an opaque data type with unlimited size, this fill value can become large. It effectively doubles the file size. I did not find it in the source code or the spec but I guess there is some limit for the maximum size of a fill value and that is why it worked up to a certain image size.

The other issue was caused by using an internal cache improperly. The base type of an opaque type is a byte array and for all types the type information are being cached (e.g. info about how to serialize the data to the file). But opaque data and byte[] data must be treated differently. However both used the same cache entry and so the opaque data were serialized with the wrong type information when the cache entry already existed which is the case in your example because of that extra attribute.

I hope it works better now!

Blackclaws commented 2 months ago

Definitely works much better now :) Thanks a lot for the quick fix. I can now use 20Mb + files as opaque datasets without any error.