JuliaIO / HDF5.jl

Save and load data in the HDF5 file format from Julia
https://juliaio.github.io/HDF5.jl
MIT License
380 stars 138 forks source link

Support szip (freely) #1132

Closed PallHaraldsson closed 7 months ago

PallHaraldsson commented 7 months ago

I see szip in the code, but I'm not sure non-proprietary code to de/compress is used. Please close if it isn't, I found free drop-in replacement here:

https://gitlab.dkrz.de/k202009/

The algorithm is patented, and likely they have run out since, I found that free code. The project here links to info on only non-commercial use, implying not fully free/open source is currently used:

https://support.hdfgroup.org/doc_resource/SZIP/

EDIT:

https://www.hdfgroup.org/2017/05/hdf5-data-compression-demystified-2-performance-tuning/

The HDF5 Library comes with two predefined compression methods, GNUzip (Gzip) and Szip and has the capability of using third-party compression methods as well.

The "third-party" linked to file not found, but I'm curious what other may be supported by underlying library, or this package, e.g. zstd? And Szip for sure freely?

I see now it's zstd plus likely at least these (any more of interest?):

H5Zblosc = "c8ec2601-a99c-407f-b158-e79c03c2f5f7" H5Zbzip2 = "094576f2-1e46-4c84-8e32-c46c042eaaa2" H5Zlz4 = "eb20ec05-5464-47b5-ba41-098e3c1068a3" H5Zzstd

zstd is a good standard, at least fast, and Szip had best compression, at least at the time, but no longer? Is some other considered best now (for scientific data), i.e. for size and/or speed, which then?

mkitti commented 7 months ago

SZIP should be installed by default and enabled.

julia> using HDF5

julia> HDF5.Filters.isencoderenabled(HDF5.API.H5Z_FILTER_SZIP)
true

julia> HDF5.API.h5z_filter_avail(HDF5.API.H5Z_FILTER_SZIP)
true
mkitti commented 7 months ago

HDF5_jll is one of two packages that depend on libaec_jll:

https://juliahub.com/ui/Packages/General/libaec_jll

mkitti commented 7 months ago

libaec_jll uses the following free source

https://github.com/JuliaPackaging/Yggdrasil/blob/8037b4b7169aa4436f286a6d2e6f6e2fbe63ce79/L/libaec/build_tarballs.jl#L9-L11

mkitti commented 7 months ago

For good measure, this should be disambiguation from https://github.com/szcompressor/SZ

PallHaraldsson commented 7 months ago

Good to see libaec_jll has HDF5_jll as a dependent, and thus HDF5.jl. That's what I wanted to see, and I had actually looked at:

https://juliahub.com/ui/Packages/General/HDF5_jll

and it's not listed as a dependency, or I would not have opened this issue. I realize it's cached information, and likely not often if ever updated. I've noticed missing package before. I suppose libaec_jll got added later, even recently.

I think I'll be closing the issue, but regarding SZ, I think you're saying we should support, then yes, if it's much used to read such files, or rather just later variant linked from there (seems very intriguing):

Note: SZ3 has been released here. SZ3 has much higher compression ratios than SZ2 in many cases, with comparable throughput (suffering slightly degraded throughput though). Details can be found in our ICDE21 paper.

SZ3: Kai Zhao, Sheng Di, Maxim Dmitriev, Thierry-Laurent D. Tonellot, Zizhong Chen, and Franck Cappello. "Optimizing Error-Bounded Lossy Compression for Scientific Data by Dynamic Spline Interpolation", Proceeding of the 37th IEEE International Conference on Data Engineering (ICDE 21), Chania, Crete, Greece, Apr 19 - 22, 2021.

SZauto: Kai Zhao, Sheng Di, Xin Liang, Sihuan Li, Dingwen Tao, Zizhong Chen, and Franck Cappello. "Significantly Improving Lossy Compression for HPC Datasets with Second-Order Prediction and Parameter Optimization", Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 20), Stockholm, Sweden, 2020. (code: https://github.com/szcompressor/SZauto/)