TRIQS / h5

A high-level C++ interface to the hdf5 library
https://triqs.github.io/h5
Other
3 stars 7 forks source link

Adjustable compression level for groups #6

Closed hmenke closed 3 years ago

hmenke commented 3 years ago

The deflate compression of HDF5 supports multiple levels (0-9) but currently only level 1 is being used for array data. Only applying compression for array data is reasonable but sometimes I wish a higher level could be chosen, especially when there are quantities that are mostly zero. This PR makes the compression level adjustable on group creation and adds a few assertions to ensure that a valid value is chosen.

Unfortunately, at this point the compression level does not round-trip through the file, i.e. when loading a dataset and inspecting the compression level it is the one that the group was created with (default: 1) and not the one from the file. However, it seems that currently no filter information is read from the HDF5 file at all, so I left it at that.

https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetDeflate

Wentzell commented 3 years ago

Dear @hmenke,

Do you have an example where the higher compression levels yield a substantial reduction of the archive size?

Our experience has been that the gains of the higher compression levels are marginal, while resulting in substantial increase of performance cost for read/write operations.

hmenke commented 3 years ago

Hm, you're right. I just tested this and while I can not confirm a huge performance reduction, the returns in compression are diminishing (comparing level 1 vs. 6).

Anyway, this PR does not change the default compression level of 1 and other libraries like h5py also provide an interface to compression: https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline

In fact, this PR can also be viewed the other way around in that it now allows to disable compression resulting in a considerable speedup (2.3x for my little test case).

Wentzell commented 3 years ago

The goal of the Python layer is not to provide an exhaustive API like the one of h5py, but instead to provide a high-level API that allows for easy read/write of more complicated objects like Green functions.

Do you have a particular use-case where disabling the compression at the Python level would be helpful? To our experience, the hdf5 read/write operations are usually not performance-critical operations, even for large objects.

hmenke commented 3 years ago

Looks like this is a lot less useful than I hoped it to be.