HDFGroup / h5pyd

h5py distributed - Python client library for HDF Rest API
Other
110 stars 39 forks source link

hsds return "filter "H5Z_FLETCHER_DEFLATE" not recognized" due to hsload datasets with FLETCHER32 filter #128

Closed trmt closed 1 year ago

trmt commented 1 year ago

Is that valid filter class https://github.com/HDFGroup/h5pyd/blob/1bd5cf9ce4a8053ecd30e224604bcefc0e567f72/h5pyd/_hl/filters.py#L137 or it should be H5Z_FILTER_FLETCHER32 ?

jreadey commented 1 year ago

Thanks for reporting this!

This fixes the class name issue: https://github.com/HDFGroup/h5pyd/pull/129.

But if I run something like:

import h5pyd

f = h5pyd.File("/home/john/fletch32.h5", mode="w") dset = f.create_dataset("dset", (100, 100), dtype="i4", compression="gzip", fletcher32=True)

I still get an error from HSDS;


ERROR:root:POST error: 400
ERROR:root:POST error - status_code: 400, reason: filter {'class': 'H5Z_FILTER_FLETCHER32', 'id': 3, 'name': 'fletcher32'} is not supported
Traceback (most recent call last):
  File "/Users/john/projects/h5pyd/make_fletcher32.py", line 4, in <module>
    dset = f.create_dataset("dset", (100, 100), dtype="i4", compression="gzip", fletcher32=True)
  File "/Users/john/projects/h5pyd/h5pyd/_hl/group.py", line 361, in create_dataset
    dsid = dataset.make_new_dset(self, shape=shape, dtype=dtype, data=data, **kwds)
  File "/Users/john/projects/h5pyd/h5pyd/_hl/dataset.py", line 294, in make_new_dset
    rsp = parent.POST(req, body=body)
  File "/Users/john/projects/h5pyd/h5pyd/_hl/base.py", line 1041, in POST
    raise IOError(rsp.reason)
OSError: filter {'class': 'H5Z_FILTER_FLETCHER32', 'id': 3, 'name': 'fletcher32'} is not supported

Currently other than compression filters, only shuffle is supported in HSDS.

Could you provide some background on your use case?

When using AWS S3 or. Azure Blob storage, I don't think data corruption should normally be an issue. In these systems each chunk will get replicated across different drives and they internally use checksums (ETags). Using fletcher32 when running HSDS with posix might be beneficial, but even here I think most modern filesystem (Ext4, xfs) use checksum internally.

For compatibility, we could have HSDS accept the filter option and just ignore it when reading and writing data. Not sure if that's acceptable or not.

trmt commented 1 year ago

One area of research is the migration of existing data (which can contains datasets with different filters include fletcher32) from FS to cloud. Since I use older versions of HSDS and h5pyd (Nov 2020) for several reasons, I can't reproduce that error. In my installation (after fix that issue) HSDS saves fletcher32 metadata when loading data with hsload, and loses them after download data with hsget. This algorithm suits me so far.

jreadey commented 1 year ago

Ok - since this is working for you I'll close this issue. Change is merged to master. If anyone needs an HSDS update to actually support fletcher, please open an issue in the HSDS repo and we can discuss.