Compression and other filter options

mkitti commented 2 years ago

The hdf5plugin package in pip and conda-forge can be found below: https://github.com/silx-kit/hdf5plugin http://www.silx.org/doc/hdf5plugin/latest/

It would be nice if one could specify an arbitrary filter as in h5repack: https://portal.hdfgroup.org/display/HDF5/h5repack

               FILT - is a string with the format: 

                 <list of objects>:<name of filter>=<filter parameters> 

                 <list of objects> is a comma separated list of object names, meaning apply 
                   compression only to those objects. If no names are specified, the filter 
                   is applied to all objects 
                 <name of filter> can be: 
                   GZIP, to apply the HDF5 GZIP filter (GZIP compression) 
                   SZIP, to apply the HDF5 SZIP filter (SZIP compression) 
                   SHUF, to apply the HDF5 shuffle filter 
                   FLET, to apply the HDF5 checksum filter 
                   NBIT, to apply the HDF5 NBIT filter (NBIT compression) 
                   SOFF, to apply the HDF5 Scale/Offset filter 
                   UD,   to apply a user defined filter 
                   NONE, to remove all filters 
                 <filter parameters> is optional filter parameter information 
                   GZIP=<deflation level> from 1-9 
                   SZIP=<pixels per block,coding> pixels per block is a even number in 
                       2-32 and coding method is either EC or NN 
                   SHUF (no parameter) 
                   FLET (no parameter) 
                   NBIT (no parameter) 
                   SOFF=<scale_factor,scale_type> scale_factor is an integer and scale_type 
                       is either IN or DS 
                   UD=<filter_number,filter_flag,cd_value_count,value_1[,value_2,...,value_N]> 
                       required values for filter_number,filter_flag,cd_value_count,value_1 
                       optional values for value_2 to value_N 
                   NONE (no parameter)

Also note the SOFF above, which is the scale-offset filter that I previously discussed.

Here filters can be specified by their registered filters numbers: https://portal.hdfgroup.org/display/support/Registered+Filter+Plugins

Beyond, the hdf5plugin Python package, filters could also be loaded as detailed here: https://portal.hdfgroup.org/display/HDF5/HDF5+Dynamically+Loaded+Filters

The HDF Group maintains repository of plugins here: https://github.com/hdfGroup/hdf5_plugins

mkitti commented 2 years ago

For h5py support, see custom compression filters here: https://docs.h5py.org/en/stable/high/dataset.html#custom-compression-filters

clbarnes commented 2 years ago

Is there a large demand for other filters? In most e.g. N5 datasets I've come across in the wild, I've mainly seen gzip. As I understand it, these raw images compress so poorly anyway that compression itself isn't necessarily worth the hassle, let alone variations between compression algorithms. There is certainly a benefit to keeping things simple and widely supported. However, if there are significant gains in storage efficiency and performance in using e.g. blosc and different compressors, I'd be willing.

These HDF5 files are unlikely to be the final form of these data - practically any downstream use will require scaling, contrast correction, and alignment, at which point other forms of filter and compression could be applied. My goal here is to produce a widely-compatible first form of the data so that everyone can use Jeiss images without having to concern themselves with the .dat format.

mkitti commented 2 years ago

We have experimented with compression filters in the past: https://docs.google.com/presentation/d/1d1xH93uxTnUBlr5IrWOQjTljEvakwmgZu1kweWYMGAo/edit?usp=sharing

Basically we can get better compression using bitshuffle / zstd either directly or via Blosc. The file is about 70% of the original size and decompresses a factor of 4x faster than gzip.

clbarnes commented 2 years ago

That does look like a significant gain - are these plugins easily available in standard channels alongside HDF5 libraries? Don't want to impede adoption by getting too experimental!

mkitti commented 2 years ago

For Python, the plugins are very easily installable: https://pypi.org/project/hdf5plugin/ https://anaconda.org/conda-forge/hdf5plugin

In general, The HDF Group also provides downloadable binaries for each release: https://www.hdfgroup.org/downloads/hdf5/

I'm currently working on improving access via Java: https://github.com/scijava/pom-scijava/issues/181 https://github.com/JaneliaSciComp/jhdf5/tree/mkitti/hdf5_libsh

I'm hoping to put together a plugin package for ImageJ / FIJI soon once I can update the base jhdf5 library.

mkitti commented 2 years ago

The main issue with Java is the currently distributed jhdf5 library in FIJI statically links the original HDF5 library: https://sissource.ethz.ch/sispub/jhdf5/-/tree/master/libs/native/jhdf5

The library only exports JNI symbols and not the original HDF5 symbols which some of the plugins need. The branch I posted above fixes this but splitting the library into two shared libraries: hdf5 and jhdf5 (with JNI symbols).

Some plugins such as ZSTD do not actually callback into the HDF5 library. In this case setting the HDF5_PLUGIN_PATH to either the HDF Group plugins or the Python package may be sufficient.

clbarnes commented 2 years ago

I've added h5py's built-in byteshuffle, scale-offset, and checksum options on the basis that they're probably pretty ubiquitous. I'd like to be cautious about the others: I want to avoid users getting an HDF5 file and finding they can't open it with standard tooling, and even hdf5plugin requires all openers of the file to have the package imported.

mkitti commented 2 years ago

These are the filters within the HDF5 code base itself:

Filter identifiers for the filters distributed with the HDF5 Library are as follows:

H5Z_FILTER_DEFLATE | The gzip compression, or deflation, filter -- | -- H5Z_FILTER_SZIP | The SZIP compression filter H5Z_FILTER_NBIT | The N-bit compression filter H5Z_FILTER_SCALEOFFSET | The scale-offset compression filter H5Z_FILTER_SHUFFLE | The shuffle algorithm filter H5Z_FILTER_FLETCHER32 | The Fletcher32 checksum, or error checking, filter

https://portal.hdfgroup.org/display/HDF5/Filters

The main one that might be disabled is SZIP due to patent issues.

clbarnes commented 2 years ago

Got it, so even lzf isn't a given.

I've done some very loose benchmarking (one single-channel image, one run per configuration, writing to memory) and came up with this:

rel_write_time  rel_read_time   rel_size    write_time(s)   read_time(s)    size(B) filters
1.04    0.91    1.00    2.10    0.19    527883320   
11.23   12.95   0.79    22.61   2.72    418421427   gzip
2.12    5.38    1.00    4.28    1.13    525460561   lzf
2.19    8.48    0.78    4.41    1.78    413065378   scaleoffset
7.39    15.05   0.77    14.87   3.16    403841668   scaleoffset+gzip
3.02    8.66    0.78    6.08    1.82    412653629   scaleoffset+lzf
1.13    1.69    1.00    2.28    0.35    527883320   byteshuffle
6.90    6.80    0.70    13.90   1.43    366895195   byteshuffle+gzip
2.09    4.43    0.82    4.20    0.93    434552392   byteshuffle+lzf
2.26    9.07    0.78    4.56    1.90    413065402   byteshuffle+scaleoffset
7.69    15.62   0.77    15.49   3.28    404077271   byteshuffle+scaleoffset+gzip
3.02    9.25    0.78    6.08    1.94    412653798   byteshuffle+scaleoffset+lzf
1.11    1.59    1.00    2.23    0.33    527883320   bitshuffle
1.17    1.80    0.74    2.36    0.38    390987495   bitshuffle+lz4
1.03    1.01    1.00    2.07    0.21    527200382   lz4
1.29    2.85    0.79    2.60    0.60    416709700   zstd
1.03    0.94    1.00    2.07    0.20    527883320   blosc+blosclz+0sh
1.57    2.61    0.87    3.15    0.55    458438347   blosc+blosclz+Bsh
1.07    0.97    1.00    2.16    0.20    527883320   blosc+blosclz+bsh
1.07    0.94    1.00    2.16    0.20    527120922   blosc+lz4+0sh
1.31    1.87    0.85    2.63    0.39    450420924   blosc+lz4+Bsh
1.10    0.94    1.00    2.22    0.20    527120922   blosc+lz4+bsh
4.66    1.23    1.00    9.38    0.26    525304316   blosc+lz4hc+0sh
7.13    1.59    0.76    14.37   0.33    403198524   blosc+lz4hc+Bsh
4.69    1.39    1.00    9.44    0.29    525304316   blosc+lz4hc+bsh
11.86   14.17   0.79    23.88   2.97    418710050   blosc+zlib+0sh
11.28   5.65    0.68    22.72   1.18    361219997   blosc+zlib+Bsh
11.47   13.81   0.79    23.11   2.90    418710050   blosc+zlib+bsh
3.97    3.80    0.79    7.99    0.80    416265416   blosc+zstd+0sh
9.90    2.88    0.70    19.94   0.61    368839149   blosc+zstd+Bsh
3.88    3.56    0.79    7.82    0.75    416265416   blosc+zstd+bsh

Some of it doesn't seem to make much sense (e.g. not seeing any significant size decrease for some compressors) but it does look like blosc+zstd+byteshuffle is a good combination, for size and reading at least. The bitshuffle+lz4 is nearly as good while quite a lot faster.

mkitti commented 2 years ago

Some of these are not compressors at all. The shuffles just permute the data

mkitti commented 2 years ago

lz4 basically does really fast run length encoding or similar. I found it can be very sensitive to distribution of the data.

mkitti commented 2 years ago

I just heard that Mathworks is thinking about bundling some plugins with MATLAB.

https://www.mathworks.com/help/matlab/import_export/read-and-write-hdf5-datasets-using-dynamically-loaded-filters.html

clbarnes commented 2 years ago

Yeah, I know that some filters shouldn't expect to compress, but there are a few blosc+compressor with various shuffles with out a 1% compression, which surprised me.

mkitti commented 2 years ago

By the way, what are Bsh and bsh? I'm assuming they are the different shuffles, but I am not clear which is which. For scale offset, where were the scale and offset?

clbarnes commented 2 years ago

0sh = no shuffling, bsh = bit shuffling, Bsh = Byte shuffling.

For scale-offset, I used 0 if enabled, so HDF5 figures out the parameters on a per-chunk basis for lossless compression, documented here https://docs.h5py.org/en/stable/high/dataset.html#dataset-scaleoffset

mkitti commented 2 years ago

That's what I had thought. I'm surprised that byte shuffle results in smaller size files than when bit shuffle has been applied. In my experience, bit shuffle tends to beat byte shuffle in terms of compression size, so now I'm trying to imagine a scenario how the converse could be true.

clbarnes / jeiss-convert

Compression and other filter options #1