conda-forge / netcdf4-feedstock

A conda-smithy repository for netcdf4.
BSD 3-Clause "New" or "Revised" License
3 stars 25 forks source link

Questions on compression #154

Open zklaus opened 1 year ago

zklaus commented 1 year ago

Comment:

I wanted to play around with new compression options in Netcdf. For those to whom this means anything, the background is that I would like to write suggestions/requirements for chunking, quantization, and compression into the next Data Request for CMIP7.

I expected to be able to do most of that with netcdf4 alone, but I found some surprises.

I wrote this little program to do some tests. It creates some random data, chunks it somewhat reasonably, stores it raw and quantized and compressed with different compression methods. Running it in an environment created with mamba create -n nc-comp-test-2 humanfriendly netCDF4 pandas, only zlib and szip compression is available. I was notably surprised by the absence of zstd and bzip2 compression. I could make those available by installing the ccr package, but I was under the impression that at least zstd should be available by netcdf4 alone?

I also tried the two variants blosc_zstd and blosc_zlib, which both ran with no exception, but didn't produce any compression at all. Here are some results from running the script:

Compression Filesize Time Compression ratio
None 0 40.02 MB 0.933495 1.000000
zlib 1 16.6 MB 1.450678 2.410651
szip 4 18.09 MB 1.112253 2.212583
zstd -4 NaN 0.007567 NaN
zstd 12 NaN 0.004192 NaN
blosc_zstd 4 40.02 MB 0.878917 1.000000
blosc_zlib 4 40.02 MB 0.871195 1.000000

With ccr:

Compression Filesize Time Compression ratio
None 0 40.02 MB 0.934911 1.000000
zlib 1 16.6 MB 1.444798 2.410664
szip 4 18.09 MB 1.182662 2.212521
zstd -4 33.68 MB 1.006403 1.188157
zstd 12 19.96 MB 2.793166 2.005012
blosc_zstd 4 40.02 MB 0.936026 1.000000
blosc_zlib 4 40.02 MB 0.934330 1.000000

So overall, my questions are:

PS: Of course, actual performance will be dependent on the nature of the data, but I'd like to make sure I understand how things should work technically.

ocefpaf commented 1 year ago

@zklaus I never used anything desides zlib so I'm not an expert here. My guess is that may we should add those as hdf5 dependencies for netcdf-c to pick them up? Not sure. However, I'd prefer if we could get an error for an unavailable compression option rather than no-compression. This may be an upstream issue though.

zklaus commented 1 year ago

Thing is, zstd and bz2 are dependencies of libnetcdf (see here) and support is claimed in the included libnetcdf.settings file.

ocefpaf commented 1 year ago

Thing is, zstd and bz2 are dependencies of libnetcdf (see here) and support is claimed in the included libnetcdf.settings file.

Yep. That is why I think they should be in hdf5, at least for the netcdf4 format, maybe you can use them to compress netcdf-classic. Again, not sure, I did not test this. Just speculating. Let's ping an expert here (@dopplershift) for help.

dopplershift commented 1 year ago

I'm foggy on whether this needs support from HDF5. @WardF @DennisHeimbigner can you shed some light?

WardF commented 1 year ago

Zstandard requires libzstandard (libzstd?) be installed on the system. There should be a bundled bz2 implementation to fall back on when one is not present on the system. Let me dig into this.

WardF commented 1 year ago

Ah, I think I understand the question better now. Give me a few to get in front of a keyboard, instead of the GitHub app on my phone.

DennisHeimbigner commented 1 year ago

We do provide an internal implementation of bzip2 primarily for testing purposes. Assuming it is not too large, we could also do that for zstd, or could replace bzip2 with zstd for testing. I suppose that creates a slippery slope (e.g. do we also add blosc?)

dopplershift commented 1 year ago

@DennisHeimbigner I think that's possible, but don't lose focus on the actual problem here: netcdf-c was compiled with support for bz2 and zstd from systems packages, libnetcdf.settings agrees, but the zstd support didn't work.

DennisHeimbigner commented 1 year ago

Then the issue would appear to be that our zstd detector in configure.ac is not working correctly. [Ward- did we not have a similar problem before where presence of headers was not sufficient to assume presence of library, or something like that?]

WardF commented 1 year ago

@DennisHeimbigner We did, but in that instance, the issue was that the library was not being detected even though it installs. In this case, it appears that the library is detected, but the results are unexpected.

WardF commented 1 year ago

I'm poking around, but don't see the libnetcdf.settings file attached to this issue. Would it be possible for somebody to point me towards it?

DennisHeimbigner commented 1 year ago

BTW, is this problem when using CMake or with using AutoMake?

zklaus commented 1 year ago

I'm sorry, it seems we are in quite unfortunate relative time zones.

Conda-forge builds libnetcdf with Cmake (see here).

I did a bit more poking myself and came away with the impression that we need not only libnetcdf proper and the various compression library, but also the corresponding plugins. They are part of libnetcdf, but the Conda-forge build does not install them at the moment. I have opened conda-forge/libnetcdf-feedstock#172 to change that, and with the build from there (downloaded via the artifacts there and installed as a local package) things work as expected.

If it's true that we need the plugins, then the problem is the detection in the Cmake file and somehow overall bumpiness of the workflow: The HDF5 library has a default plugin dir, but also an environment variable; the relationship of the two is not so clear. I did not figure out a super elegant way to detect the default dir, opting in the end to extract it from the H5pubconf.h file.

So to cut a long ramble short: Is it correct that we need the plugins? If so, let's discuss in the libnetcdf feedstock how exactly we want to install them.

For the blosc ones, it would probably be good to get an error similar to the other ones instead of a silent failure to compress; if you agree with that, we should open an issue upstream.

DennisHeimbigner commented 1 year ago

our CMakeList.txt file uses a module in cmake/modules to locate a number of libraries, including zstd. The code for those modules was taken from the web. It is distinctly possible that the zstd module is not deficient. If someone can find a better module, or provide fixes for the current one, please let us know.

WardF commented 1 year ago

To reference back to Dennis' earlier comment, it may also be that libzstd-dev also needs to be installed. In my (frustrating, frustrated) experience, some systems package the necessary header files in libzstd, others require libzstd-dev.

ocefpaf commented 1 year ago

In theory conda-forge should have all of those. We don't usually split packages like that.

WardF commented 1 year ago

Glad to hear that; I'm splitting attention between this and some reported s390x issues (amongst other things), so I haven't dug into the provided logs yet, but wanted to make mention and reverse my earlier statement that it might not be the issue. Glad we can eliminate that as an issue, here!

dopplershift commented 1 year ago

@zklaus I think it was glossed over in the responses, but I'm pretty sure you're correct that the problem is that the plugins aren't being installed. There are a variety of issues on the Unidata netcdf-c repository about the plugin directory, though the problem here sounds exactly like Unidata/netcdf-c#2294.

zklaus commented 1 year ago

Thanks, @dopplershift. It does sound similar, though it seems to deal more with the autotools build and maybe the filter isn't even built there?

For me, there are three issues:

Silent non-compression on blosc_*

I now think this is because blosc was not a dependency on the libnetcdf build. Consequently, the filter was not built at all and HAVE_BLOSC is false, leading us to https://github.com/Unidata/netcdf-c/blob/43abd699e19db24e27e1e800086ff8142f3b07ad/libdispatch/dfilter.c#L502.

On the other hand, zstd was named as a dependency and detected by the Cmake build, which built the corresponding filter and set HAVE_ZSTD to true, but did not install the filter. Consequently, we end up at https://github.com/Unidata/netcdf-c/blob/43abd699e19db24e27e1e800086ff8142f3b07ad/libdispatch/dfilter.c#L444 instead.

Personally, I would prefer to get an error if a compression option is requested, that was not built into the library, but at least I understand what happened.

Cmake build quirks

The detection of dependencies (zstd, blosc, ...) seems to work correctly. The logic around the plugin path https://github.com/Unidata/netcdf-c/blob/43abd699e19db24e27e1e800086ff8142f3b07ad/CMakeLists.txt#L1155-1215 does not. I also think the default should be to install the plugins. Apart from possible clarification in the documentation on which Cmake parameters should be used, a better detection of the default install path might be good. As I mentioned, I think the best available option is to extract it from H5pubconf.h, though that may be something to take up with the HDF5 guys.

Strategy for plugin installation

Perhaps it would be better not to install the plugins into the hdf5 default directory. It seems quite likely that people want to have conflicting plugins in there, so IMHO, Netcdf should make sure that it gets versions of the plugins that are compatible with its own build by installing them to, say, $PREFIX/lib/netcdf-plugins/{hdf5,nczarr,} and prepending that path using the H5PL interface.

edwardhartnett commented 2 weeks ago

Hopefully @zklaus you got your zstd problems resolved and were able to do your benchmarks. I would be very interested in the results. (In my testing, I try to use real data rather than generated random numbers for compression. Random numbers, unless constrained in some way, will be all over the map and not very compressible. That doesn't match real science data, where, for example, a 4D field of atmospheric pressure will generally have numbers close to their neighbors. much more suitable for compression.)

I've just taken another swing at the plugin install situation in cmake and autotools build. The fact remains that you need to specify the plugin_dir configure/cmake option, and set it at configure-time and also at run time, in HDF5_PLUGIN_PATH. I have added documentation about this which will be part of the 4.9.3 release.

Let me know if you have further troubles with this after the upcoming 4.9.3 release.

One open question at the moment is whether netcdf should remember this choice and notify HDF5 where your netCDF plugins are, without the need to set HDF5_PLUGIN_PATH...

zklaus commented 2 weeks ago

Thanks for getting back to me, @edwardhartnett. I no longer work in climate science (but rather more directly w/ conda-forge, conda, and other packaging related things at Quansight), so I am afraid I won't have much more to contribute here.

I do agree with you that real world data is often far from random and have preferred actual data in my experiments as well.