Unidata / netcdf4-python

netcdf4-python: python/numpy interface to the netCDF C library
http://unidata.github.io/netcdf4-python
MIT License
756 stars 264 forks source link

blosc compressor #759

Open milos-korenciak opened 6 years ago

milos-korenciak commented 6 years ago

Could there be added blosc compressor also? At least like in PyTables.

jswhit commented 6 years ago

The ability to add custom compression filters to netcdf-c has just been added to the c library, so this is now technically possible (although it still needs to be added to the python interface). See https://www.unidata.ucar.edu/software/netcdf/docs/md__Users_wfisher_Desktop_gitprojects_netcdf-c_docs_filters.html

jswhit commented 6 years ago

I'm curious why you want this feature though - the files created with a custom compression filter will not be portable. Other netcdf clients won't be able to read the file.

crusaderky commented 6 years ago

I have the same need. I plan to use the LZF compressor which is standard in the HDF5 library.

Could we have an authoritative direction on how the Python API should look like after the change? I need to (quite urgently) replicate the same API in the hd5netcdf legacy in order to expose it with xarray - see https://github.com/pydata/xarray/issues/1536.

Suggestion: in dataset.createVariable, add parameters compression and compression_opts, exactly like in hdf5py. gzip=True, complevel=9 would become deprecated aliases for compression="gzip", compression_opts=9

crusaderky commented 6 years ago

@DennisHeimbigner ping

milos-korenciak commented 6 years ago

Maybe it will not be portable from the beginning, but it is simply superior to actual compression. This should be definitely in docs. But anyway - if you would have tons of boxes of INTERNAL data archive in .nc (and ALL your scripts uses netcdf4-python) it is very sexy to simply gain speed and better compression by simply setting another compressor in config. (We cannot use the max compress ratio in zlib because of the time. With blosc we could increase the compression level to get better time-size ratio. It is economy.) I am pretty sure netcdf-c will follow soon.

dopplershift commented 6 years ago

netCDF4 cannot be leading the C library. Also, creating non-portable files isn't just a problem for an individual user, it can lead to support problems for the software projects. I'm not saying the libraries shouldn't work to supporting better compression, but it's not as simple as just hacking something into netCDF4-python, IMO.

DennisHeimbigner commented 6 years ago

As an aside, we deliberately did used filter ids rather than names because there is no standard name for compression filters (ie. your kw args are your own invention). The id is standardized, however: see https://support.hdfgroup.org/services/filters.html

jswhit commented 6 years ago

Seems like it would be pretty straightforward to add support in netcdf4-python for dynamically loaded filters installed in /usr/local/hdf5/plugins, using the new interfaces in netcdf 4.6.0. However, users will still have to figure out how to install the filters in the hdf5 plugin directory. I see that this looks pretty simple for Blosc (https://github.com/Blosc/hdf5-blosc) and LZF (https://github.com/h5py/h5py/tree/master/lzf), but I imagine this will be out of reach for most users.

DennisHeimbigner commented 6 years ago

Agreed, it is not trivial. I provide the bzip2 example in netcdf-c/examples/C/hdf5plugins as an example and I know that some people at NCAR were able to use it for a different compressor -- fpzip. The key point is that the compression code must provide a block oriented compress and decompress API. If it were the case that .so files for linux were usable across various versions of linux, then it would be possible to publish various .so files for various filters. Otherwise, they have to be built from the source code.

jswhit commented 6 years ago

Looks like https://github.com/nexusformat/HDF5-External-Filter-Plugins provides a relatively simple way to install bzip2, lz4 and blosc.