Open mkitti opened 2 years ago
For h5py support, see custom compression filters here: https://docs.h5py.org/en/stable/high/dataset.html#custom-compression-filters
Is there a large demand for other filters? In most e.g. N5 datasets I've come across in the wild, I've mainly seen gzip. As I understand it, these raw images compress so poorly anyway that compression itself isn't necessarily worth the hassle, let alone variations between compression algorithms. There is certainly a benefit to keeping things simple and widely supported. However, if there are significant gains in storage efficiency and performance in using e.g. blosc and different compressors, I'd be willing.
These HDF5 files are unlikely to be the final form of these data - practically any downstream use will require scaling, contrast correction, and alignment, at which point other forms of filter and compression could be applied. My goal here is to produce a widely-compatible first form of the data so that everyone can use Jeiss images without having to concern themselves with the .dat format.
We have experimented with compression filters in the past: https://docs.google.com/presentation/d/1d1xH93uxTnUBlr5IrWOQjTljEvakwmgZu1kweWYMGAo/edit?usp=sharing
Basically we can get better compression using bitshuffle / zstd either directly or via Blosc. The file is about 70% of the original size and decompresses a factor of 4x faster than gzip.
That does look like a significant gain - are these plugins easily available in standard channels alongside HDF5 libraries? Don't want to impede adoption by getting too experimental!
For Python, the plugins are very easily installable: https://pypi.org/project/hdf5plugin/ https://anaconda.org/conda-forge/hdf5plugin
In general, The HDF Group also provides downloadable binaries for each release: https://www.hdfgroup.org/downloads/hdf5/
I'm currently working on improving access via Java: https://github.com/scijava/pom-scijava/issues/181 https://github.com/JaneliaSciComp/jhdf5/tree/mkitti/hdf5_libsh
I'm hoping to put together a plugin package for ImageJ / FIJI soon once I can update the base jhdf5 library.
The main issue with Java is the currently distributed jhdf5 library in FIJI statically links the original HDF5 library: https://sissource.ethz.ch/sispub/jhdf5/-/tree/master/libs/native/jhdf5
The library only exports JNI symbols and not the original HDF5 symbols which some of the plugins need. The branch I posted above fixes this but splitting the library into two shared libraries: hdf5 and jhdf5 (with JNI symbols).
Some plugins such as ZSTD do not actually callback into the HDF5 library. In this case setting the HDF5_PLUGIN_PATH
to either the HDF Group plugins or the Python package may be sufficient.
I've added h5py's built-in byteshuffle, scale-offset, and checksum options on the basis that they're probably pretty ubiquitous. I'd like to be cautious about the others: I want to avoid users getting an HDF5 file and finding they can't open it with standard tooling, and even hdf5plugin requires all openers of the file to have the package imported.
These are the filters within the HDF5 code base itself:
Filter identifiers for the filters distributed with the HDF5 Library are as follows:
https://portal.hdfgroup.org/display/HDF5/Filters
The main one that might be disabled is SZIP due to patent issues.
Got it, so even lzf isn't a given.
I've done some very loose benchmarking (one single-channel image, one run per configuration, writing to memory) and came up with this:
rel_write_time rel_read_time rel_size write_time(s) read_time(s) size(B) filters
1.04 0.91 1.00 2.10 0.19 527883320
11.23 12.95 0.79 22.61 2.72 418421427 gzip
2.12 5.38 1.00 4.28 1.13 525460561 lzf
2.19 8.48 0.78 4.41 1.78 413065378 scaleoffset
7.39 15.05 0.77 14.87 3.16 403841668 scaleoffset+gzip
3.02 8.66 0.78 6.08 1.82 412653629 scaleoffset+lzf
1.13 1.69 1.00 2.28 0.35 527883320 byteshuffle
6.90 6.80 0.70 13.90 1.43 366895195 byteshuffle+gzip
2.09 4.43 0.82 4.20 0.93 434552392 byteshuffle+lzf
2.26 9.07 0.78 4.56 1.90 413065402 byteshuffle+scaleoffset
7.69 15.62 0.77 15.49 3.28 404077271 byteshuffle+scaleoffset+gzip
3.02 9.25 0.78 6.08 1.94 412653798 byteshuffle+scaleoffset+lzf
1.11 1.59 1.00 2.23 0.33 527883320 bitshuffle
1.17 1.80 0.74 2.36 0.38 390987495 bitshuffle+lz4
1.03 1.01 1.00 2.07 0.21 527200382 lz4
1.29 2.85 0.79 2.60 0.60 416709700 zstd
1.03 0.94 1.00 2.07 0.20 527883320 blosc+blosclz+0sh
1.57 2.61 0.87 3.15 0.55 458438347 blosc+blosclz+Bsh
1.07 0.97 1.00 2.16 0.20 527883320 blosc+blosclz+bsh
1.07 0.94 1.00 2.16 0.20 527120922 blosc+lz4+0sh
1.31 1.87 0.85 2.63 0.39 450420924 blosc+lz4+Bsh
1.10 0.94 1.00 2.22 0.20 527120922 blosc+lz4+bsh
4.66 1.23 1.00 9.38 0.26 525304316 blosc+lz4hc+0sh
7.13 1.59 0.76 14.37 0.33 403198524 blosc+lz4hc+Bsh
4.69 1.39 1.00 9.44 0.29 525304316 blosc+lz4hc+bsh
11.86 14.17 0.79 23.88 2.97 418710050 blosc+zlib+0sh
11.28 5.65 0.68 22.72 1.18 361219997 blosc+zlib+Bsh
11.47 13.81 0.79 23.11 2.90 418710050 blosc+zlib+bsh
3.97 3.80 0.79 7.99 0.80 416265416 blosc+zstd+0sh
9.90 2.88 0.70 19.94 0.61 368839149 blosc+zstd+Bsh
3.88 3.56 0.79 7.82 0.75 416265416 blosc+zstd+bsh
Some of it doesn't seem to make much sense (e.g. not seeing any significant size decrease for some compressors) but it does look like blosc+zstd+byteshuffle is a good combination, for size and reading at least. The bitshuffle+lz4 is nearly as good while quite a lot faster.
Some of these are not compressors at all. The shuffles just permute the data
lz4 basically does really fast run length encoding or similar. I found it can be very sensitive to distribution of the data.
I just heard that Mathworks is thinking about bundling some plugins with MATLAB.
Yeah, I know that some filters shouldn't expect to compress, but there are a few blosc+compressor with various shuffles with out a 1% compression, which surprised me.
By the way, what are Bsh
and bsh
? I'm assuming they are the different shuffles, but I am not clear which is which. For scale offset, where were the scale and offset?
0sh = no shuffling, bsh = bit shuffling, Bsh = Byte shuffling.
For scale-offset, I used 0
if enabled, so HDF5 figures out the parameters on a per-chunk basis for lossless compression, documented here https://docs.h5py.org/en/stable/high/dataset.html#dataset-scaleoffset
That's what I had thought. I'm surprised that byte shuffle results in smaller size files than when bit shuffle has been applied. In my experience, bit shuffle tends to beat byte shuffle in terms of compression size, so now I'm trying to imagine a scenario how the converse could be true.
The hdf5plugin package in pip and conda-forge can be found below: https://github.com/silx-kit/hdf5plugin http://www.silx.org/doc/hdf5plugin/latest/
It would be nice if one could specify an arbitrary filter as in
h5repack
: https://portal.hdfgroup.org/display/HDF5/h5repackAlso note the
SOFF
above, which is the scale-offset filter that I previously discussed.Here filters can be specified by their registered filters numbers: https://portal.hdfgroup.org/display/support/Registered+Filter+Plugins
Beyond, the
hdf5plugin
Python package, filters could also be loaded as detailed here: https://portal.hdfgroup.org/display/HDF5/HDF5+Dynamically+Loaded+FiltersThe HDF Group maintains repository of plugins here: https://github.com/hdfGroup/hdf5_plugins