Benchmark compression and compress all datasets

daniilidis-group / m3ed

M3ED Dataset

39 stars 3 forks source link

Benchmark compression and compress all datasets #5

Open fcladera opened 1 year ago

fcladera commented 1 year ago

Current scenario: only some datasets are compressed in the data.h5 files using LZF.

For homogeneity, it would be good to compress all the datasets using the same compression. We should benchmark all possible compression filters (compression ratio and compression speed) and pick the filter with the best results.

klowrey commented 1 year ago

h5dump -pH data.h5 | grep COMMENT to see the file's compression methods, although it wont show what hasn't been compressed.

fcladera commented 1 year ago

@k-chaney did a few benchmarks on our data (i71065G7 processor from NVME SSD)

read_time size

Based on these plots, the read time when using gzip is 1.8x the read time when not using compression, and 3.6x compared to the fastest compression (blosc_lz4_shuffle). While there is a reduction in size, we don't think this reduction in read performance is acceptable. We are tempted to switch to blosc_lz4_shuffle for v1.2.

We did test this in Julia using H5Zblosc, and it is part of the standard filters https://portal.hdfgroup.org/display/support/Filters#Filters-32001.

Would you have any extra feedback on this @klowrey?

klowrey commented 1 year ago

If this was me, I would just release everything with Gzip4 so that you could save on bandwidth costs on distribution, then also release a python script to convert the files to whatever compression (or no compression) that a user would want.

You can't control how people are going to access the data or what their systems are like, but you can control how you distribute it, and as long as Gzip is available on all HDF5 distributions (including the default system ones) then that seems very universal.