Open fcladera opened 1 year ago
h5dump -pH data.h5 | grep COMMENT
to see the file's compression methods, although it wont show what hasn't been compressed.
@k-chaney did a few benchmarks on our data (i71065G7 processor from NVME SSD)
Based on these plots, the read time when using gzip is 1.8x the read time when not using compression, and 3.6x compared to the fastest compression (blosc_lz4_shuffle). While there is a reduction in size, we don't think this reduction in read performance is acceptable. We are tempted to switch to blosc_lz4_shuffle for v1.2.
We did test this in Julia using H5Zblosc, and it is part of the standard filters https://portal.hdfgroup.org/display/support/Filters#Filters-32001.
Would you have any extra feedback on this @klowrey?
If this was me, I would just release everything with Gzip4 so that you could save on bandwidth costs on distribution, then also release a python script to convert the files to whatever compression (or no compression) that a user would want.
You can't control how people are going to access the data or what their systems are like, but you can control how you distribute it, and as long as Gzip is available on all HDF5 distributions (including the default system ones) then that seems very universal.
Current scenario: only some datasets are compressed in the
data.h5
files using LZF.For homogeneity, it would be good to compress all the datasets using the same compression. We should benchmark all possible compression filters (compression ratio and compression speed) and pick the filter with the best results.