LLNL / H5Z-ZFP

A registered ZFP compression plugin for HDF5
Other
51 stars 22 forks source link

Retrieving fixed accuracy parameters from ZFP encoded HDF5 datasets #105

Open leighorf opened 1 year ago

leighorf commented 1 year ago

Hello,

I have gone through great pains to carry fixed accuracy parameter metadata with all of my conversions of data that use ZFP. I often operate on ZFP compressed data and compress the results, and I want to make sure my final accuracy parameters are OK given the original accuracy parameters.

However it occurs to me that at least for a saved ZFP encoded HDF5 dataset, it should be possible to open a HDF5 file with ZFP compressed data and retrieve the original floating point representation of the accuracy parameter for each dataset (I know it is possible to do this with the zfp library). It is not evident how to do this with the H5Z-ZFP interface, but that is what I desire: The ability to retrieve the ZFP fixed accuracy parameter of a H5Z-ZFP compressed HDF5 dataset.

markcmiller86 commented 1 year ago

This is a very reasonable request, @leighorf. That information is encoded, withOUT loss, in the datasets creation cd_values as the ZFP stream's header.

I think its probably best to add a function to the library interface to H5Z-ZFP for this. It requires a combination of HDF5 and ZFP library calls.

In lieu of such a function, given an existing dataset id of dsid, I think it is possible do something like...

hid_t cpid = H5Dget_create_plist(dsid);
unsigned int flags;
size_t nelemts = 10;
unsigned cd_vals[10];
H5Pget_filter_by_id2(cpid, H5Z_FILTER_ZFP, &flags, &nelemts, cd_vals, ...);

// cd_vals contains, starting at entry index 1, the ZFP stream header. So, now, open that as a bitstream...
bitstream *dummy_bstr = stream_open(&cd_vals[1], sizeof(cd_vals))));
zfp_stream *dummy_zstr = zfp_stream_open(dummy_bstr);

// now, query stream for info you seek...
zfp_mode zm = zfp_stream_compression_mode(dummy_zstr);
double rate = zfp_stream_rate(dummy_zstr, dim);
double accuracy = zfp_stream_accuracy(dummy_zstr);
uint precision = zfp_stream_precision(dummy_zstr);
zfp_stream_close(dummy_zstr);
stream_close(dummy_bstr);
markcmiller86 commented 1 year ago

@brtnfld and @leighorf I am about 1/2 way through having this completed. Maybe a little more than that.

I just realized, however, I don't fully understand all the context(s) in which retrieving ZFP encoding params would be needed. Here are some of the ways I am thinking...

lindstro commented 1 year ago

I would suggest essentially duplicating the current zfp API for querying these parameters. It's probably not a good idea for the H5Z_zfp functions to do this in a slightly different way.

Another possibility is to piggyback on the zfp_config struct available as of zfp 1.0.0. Unfortunately, functions are currently missing for querying a config struct. This will be added to the next release.

markcmiller86 commented 1 year ago

You mean for querying an already compressed dataset?

I think if callers want to use ZFP library interface, then all we should provide is a means to obtain a zfp_stream* object to use in those calls and they can just use them. In fact, that might be better way to go since they have to link to ZFP either way to get that information.

Related to this, I just realized yesterday that H5Z-ZFP mode integers don't map 1:1 to ZFP's mode enums. For example in H5Z-ZFP, mode of 3 is accuracy mode whereas in ZFP its 4.

lindstro commented 1 year ago

You mean for querying an already compressed dataset?

Well, yes, but more generally getting a zfp_config struct from a zfp_stream. The C++ compressed-array class API allows you to set the compression parameters of a zfp_stream by passing a zfp_config, e.g., via const_array::set_config(const zfp_config &config), but the high-level C API currently lacks functions for setting/getting zfp_stream parameters via zfp_config.

I think if callers want to use ZFP library interface, then all we should provide is a means to obtain a zfp_stream* object to use in those calls and they can just use them. In fact, that might be better way to go since they have to link to ZFP either way to get that information.

True, that might be a more general approach. I don't know if there are any cases where you manipulate a zfp_stream but H5Z-ZFP ignores those changes, which might result in unexpected results. The execution policy is one such setting. We should discuss how we want to support that and other zfp_stream settings going forward.

Related to this, I just realized yesterday that H5Z-ZFP mode integers don't map 1:1 to ZFP's mode enums. For example in H5Z-ZFP, mode of 3 is accuracy mode whereas in ZFP its 4.

I don't think there's much we can do about that now without breaking things.

markcmiller86 commented 1 year ago

It's probably not a good idea for the H5Z_zfp functions to do this in a slightly different way.

In this comment, were you basically speaking to how I proposed to handle the return values for error or n/a cases? If so, I agree.

lindstro commented 1 year ago

It's probably not a good idea for the H5Z_zfp functions to do this in a slightly different way.

In this comment, were you basically speaking to how I proposed to handle the return values for error or n/a cases? If so, I agree.

Right. The zfp library already has those same functions (with different names, of course), so it would make sense for H5Z-ZFP to just wrap those and use the same parameters and return values.

markcmiller86 commented 1 year ago

@leighorf I finally have a prototype implementation for this on branch feat-mcm86-04mar23-retrieve-zfp-params and wonder if you could take a look.

You can see an example of how it works for a dataset already written to a file here..

https://github.com/LLNL/H5Z-ZFP/blob/4ccd6aa406bbd2bb5f342de5de8016f3569ac8fb/test/test_read.c#L165-L212

If the caller knows nothing, it must first query for mode and then based on that, query for remaining params. If you know mode, you can avoid having to query twice. It is an error to query for zfp parameters that do not match the mode. So, if mode is accuracy but precision is queried, that will generate an error.

The caller is responsible for obtaining the desired dataset's creation property list id and passing that to H5Pget_zfp_XXX()

The implementation will handle any case...the property list is using bonified HDF5 properties, the property list is using generic properties before the dataset has been every been written, the dataset has been written.

markcmiller86 commented 1 year ago

@brtnfld I am just pinging you on this issue in case you wanted to have a look at the new functions I am working towards to retrieve ZFP compression parameters from a dataset's creation property list...

https://github.com/LLNL/H5Z-ZFP/blob/5baa4b9202dbe485ad3dc06037f6dd57db785bfd/src/H5Zzfp_props.c#L135-L382

lindstro commented 1 year ago

@markcmiller86 Just to make sure I understand how this is supposed to work, since the caller presumably does not already know what mode is, you should call H5Pget_zfp to first query the mode and then make a second call where you supply corresponding pointers to compression parameters?

As an alternative, zfp 1.0.0 supports zfp_config, which would allow you to make a single call to get all this information. zfp_config is not available pre 1.0.0, but it might be nice to have H5Pget_zfp_config() as an alternative way of querying the mode and parameters when H5Z-ZFP is built with zfp 1.0.0.

lindstro commented 5 days ago

In addition to querying compression parameter settings through the library, it would be nice to have a command-line tool that decodes cd_values, i.e., that performs the inverse of what print_h5repack_farg does.