LLNL / H5Z-ZFP

A registered ZFP compression plugin for HDF5
Other
50 stars 22 forks source link

The purpose of the set_local function #142

Open shaomeng opened 6 months ago

shaomeng commented 6 months ago

As a fellow HDF5 plugin developer, I'd like a little more information on what the set_local function does, and what purposes it serves. I'm confused because anyone invoking the plugin is already specifying compression parameters during the H5Pset_filter() call, so it seems to me that the set_local() function doesn't add any extra value.

I do notice that this page says that cd_values[] passed in during H5Pset_filter() is modified. But the HDF5 document specifies that the set_local() function receives a private copy of the dataset creation property list and does modification on it. Then what effect does the modification have if it's applied on a private copy?

I appreciate any discussion on this topic!

markcmiller86 commented 6 months ago

I belive the set_local method is used only during an H5Dcreate() call and is used as an optional opportunity to setup whatever stuff the filter may need depending on things like the dataset's data type class and/or size. For example, you can see what HDF5 library itself does in set_local for SZIP filter...

https://github.com/HDFGroup/hdf5/blob/develop/src/H5Zszip.c#L114-L238

or in NBIT filter

https://github.com/HDFGroup/hdf5/blob/develop/src/H5Znbit.c#L749-L904

In H5Z-ZFP, we convert the parameters specified by the user (either via generic cd_values or via properties interface) to the ZFP stream header and it is the ZFP stream header that gets stored as part of the dataset header. Thats because in the initial versions of the filter, we (or maybe it was just me...I don't think @lindstro cared too much) were worried that the stream header could dominate HDF5 chunk overheads. But, that concern turns out to be unrealistic because nobody tends to run with really tiny chunks (and they shouldn't either due to impact on I/O performance). So, in future versions of the filter, we may wind up just storing a separate ZFP stream header for each chunk.

So, set_local() can be a no-op and for HDF5's built-in deflate filter, it is...

https://github.com/HDFGroup/hdf5/blob/develop/src/H5Zdeflate.c#L40-L42

lindstro commented 6 months ago

we (or maybe it was just me...I don't think @lindstro cared too much) were worried that the stream header could dominate HDF5 chunk overheads.

I don't want to derail the discussion, but when H5Z-ZFP was first developed some 10 years ago, I was indeed concerned about keeping the zfp header short in case we wanted to store it per HDF5 chunk (anticipating small chunks). So we devised an efficient way of encoding (common) compression parameters (compression mode + rate/precision/accuracy) and array metadata (dimensions, scalar type) in a single 64-bit word. Even if this turned out not to be important for H5Z-ZFP in the end, I've always envisioned other applications where you want to spatially adapt compression settings (e.g., to keep high accuracy only around features of interest), and our compact metadata encoding allows you to do that.

markcmiller86 commented 6 months ago

I don't think @lindstro cared too much...

Sorry about that wording. What I was trying to say is that you probably already had figured out that the HDF5 chunks would have to have been mighty small before the ZFP stream header would become an issue...that isn't something I actually sat down to calculate until after I had already coded that aspect of the filter.

shaomeng commented 6 months ago

I didn't realize that the cd_values[] are stored per dataset instead of per chunk, and also I better understand how it's used in the case of H5Z-ZFP. I really appreciate the discussion!