fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
302 stars 78 forks source link

FillValue issues with kerchunk 1e37=> 9.99999993e+36 #177

Closed rsignell-usgs closed 2 years ago

rsignell-usgs commented 2 years ago

In my original NetCDF4 file (HDF5 under the hood) the fill value _FillValue is set to 1.e+37 for 32bit floats, and when loaded into xarray with decode_cf=True, the data values with 1e37 are correctly interpreted as fill values and set to NaN.

In the kerchunk JSON, the _FillValue in the JSON is set to 9.99999993e+36 and when loaded into xarray with decode_cf=True, the data values are not interpreted as fill values, and show up as 9.99999993e+36 (not 1e37 and not NaN).

Pretty confused on what is going on here!

Here is a Reproducible Notebook

cc @peterm790

martindurant commented 2 years ago

My hunch was, that this is due to writing the value as text in the attributes of zarr. But that raises the question of how this is normally handled outside of kerchunk, and why decode_cf matters.

martindurant commented 2 years ago

@rsignell-usgs , do you mind using xarray's to_zarr on that dataset to see what .zarray/.zattrs it creates? You can set the output chunk store (as opposed to the metadata store) to be a dict or even a custom dict that doesn't store anything (see below) to avoid writing things.

class NoOpDict(dict):
    def __setitem__(self, *_):
        pass

metadata_strore = {}
chunk_store = NoOpDict()
keewis commented 2 years ago

this may be #105: if you look at the NetCDF file with h5py, you can see that _FillValue is array([1.e+37], dtype=float32) while the fillvalue property has a value of 0.0

rsignell-usgs commented 2 years ago

@keewis, what is "fillvalue"? I've never heard of that attribute being used for fill values. That must be an HDF5-specific thing?

@martindurant , when I write Zarr, it's even more confusing. The fill_value is set to 9.999999e+36, but when read back into xarray with cf_decode=False, the value is 1e37. ????

See cells [9] and [11] here: https://nbviewer.org/gist/2400089529be8bc3924aacfd3b2485e6

keewis commented 2 years ago

That must be an HDF5-specific thing?

yes, exactly. You can examine it with

In [1]: import fsspec
   ...: import h5py
   ...: 
   ...: fs = fsspec.filesystem(
   ...:     "s3", anon=True, client_kwargs={"endpoint_url": "https://mghp.osn.xsede.org"}
   ...: )
   ...: url = "s3://rsignellbucket1/COAWST/coawst_us_20220101_01.nc"
   ...: 
   ...: f = h5py.File(fs.open(url))
   ...: f["temp"].fillvalue
Out[1]: 0.0

For float values it is used to set zarr's fill_value attribute, which AFAIU the zarr engine will use to replace _FillValue.

You might have to synchronise fill_value with _FillValue in the references (e.g. by passing the hacky function from https://github.com/fsspec/kerchunk/issues/105#issuecomment-982834456 to MultiZarrToZarr's preprocess). For example:

In [1]: import fsspec
   ...: 
   ...: fs = fsspec.filesystem(
   ...:     "s3", anon=True, client_kwargs={"endpoint_url": "https://mghp.osn.xsede.org"}
   ...: )
   ...: url = "s3://rsignellbucket1/COAWST/coawst_us_20220101_01.nc"
   ...: 
   ...: import kerchunk.hdf
   ...: 
   ...: hdf = kerchunk.hdf.SingleHdf5ToZarr(fs.open(url), url=url)
   ...: references = hdf.translate()

In [2]: import xarray as xr
   ...: 
   ...: r_opts = dict(anon=True, client_kwargs={"endpoint_url": "https://mghp.osn.xsede.org"})

In [3]: fs = fsspec.filesystem(
   ...:     "reference", fo=references, remote_protocol="s3", remote_options=r_opts
   ...: )
   ...: m = fs.get_mapper("")
   ...: 
   ...: ds = xr.open_dataset(
   ...:     m,
   ...:     engine="zarr",
   ...:     chunks={},
   ...:     backend_kwargs=dict(consolidated=False),
   ...:     decode_cf=False,
   ...: )

In [4]: ds.temp.attrs
Out[4]: 
{'_FillValue': 0.0,
 'coordinates': 'lon_rho lat_rho s_rho ocean_time',
 'field': 'temperature, scalar, series',
 'grid': 'grid',
 'location': 'face',
 'long_name': 'potential temperature',
 'time': 'ocean_time',
 'units': 'Celsius'}

In [5]: ds.temp[0, 0, 0, 0].values
Out[5]: array(1.e+37, dtype=float32)

In [6]: decoded = xr.decode_cf(ds)
   ...: decoded.temp[0, 0, 0, 0].values
Out[6]: array(1.e+37, dtype=float32)

In [7]: import ujson
   ...: import itertools
   ...: 
   ...: 
   ...: def correct_fill_values(data):
   ...:     def fix_variable(values):
   ...:         zattrs = values[".zattrs"]
   ...: 
   ...:         if "_FillValue" not in zattrs:
   ...:             return values
   ...: 
   ...:         _FillValue = zattrs["_FillValue"]
   ...:         if values[".zarray"]["fill_value"] != _FillValue:
   ...:             values[".zarray"]["fill_value"] = _FillValue
   ...: 
   ...:         return values
   ...: 
   ...:     refs = data["refs"]
   ...:     prepared = (
   ...:         (tuple(key.split("/")), value) for key, value in refs.items() if "/" in key
   ...:     )
   ...:     filtered = (
   ...:         (key, ujson.loads(value))
   ...:         for key, value in prepared
   ...:         if key[1] in (".zattrs", ".zarray")
   ...:     )
   ...:     key = lambda i: i[0][0]
   ...:     grouped = (
   ...:         (name, {n[1]: v for n, v in group})
   ...:         for name, group in itertools.groupby(sorted(filtered, key=key), key=key)
   ...:     )
   ...:     fixed = ((name, fix_variable(var)) for name, var in grouped)
   ...:     flattened = {
   ...:         f"{name}/{item}": ujson.dumps(data, indent=4)
   ...:         for name, var in fixed
   ...:         for item, data in var.items()
   ...:     }
   ...:     data["refs"] = dict(sorted((refs | flattened).items()))
   ...:     return data

In [8]: corrected_refs = correct_fill_values(references)

In [9]: fs = fsspec.filesystem(
   ...:     "reference", fo=corrected_refs, remote_protocol="s3", remote_options=r_opts
   ...: )
   ...: m = fs.get_mapper("")
   ...: 
   ...: ds = xr.open_dataset(
   ...:     m,
   ...:     engine="zarr",
   ...:     chunks={},
   ...:     backend_kwargs=dict(consolidated=False),
   ...:     decode_cf=False,
   ...: )
   ...: display(ds.temp.attrs, ds.temp[0, 0, 0, 0].values)
{'_FillValue': 1e+37,
 'coordinates': 'lon_rho lat_rho s_rho ocean_time',
 'field': 'temperature, scalar, series',
 'grid': 'grid',
 'location': 'face',
 'long_name': 'potential temperature',
 'time': 'ocean_time',
 'units': 'Celsius'}
array(1.e+37, dtype=float32)

In [10]: decoded = xr.decode_cf(ds)
    ...: decoded.temp[0, 0, 0, 0].values
Out[10]: array(nan, dtype=float32)
keewis commented 2 years ago

The fill_value is set to 9.999999e+36, but when read back into xarray with cf_decode=False, the value is 1e37

That might be a precision / formatting issue:

In [33]: np.float32(9.999999933815813e36)
Out[33]: 1e+37

In [34]: np.float64(9.999999933815813e36)
Out[34]: 9.999999933815813e+36
peterm790 commented 2 years ago

I have set up a reproducible notebook here manipulating some fill_values: https://gist.github.com/peterm790/6e6ddb2df0f095056837d21c531bbe57

It seems that only fill_value in .zarray is considered by xarray when opening as zarr from a kerchunk reference file and _FillValue in .zattrs is not considered.

The fill_value in @rsignell-usgs example is 0 (which is incorrect) so I am not sure if manually setting it to 1e37 would correctly render Nan, but might be worth investigating. This example didn't reproduce the 1e37=> 9.99999993e+36 despite setting dtype = npfloat32

keewis commented 2 years ago

just for reference, the code at the end of https://github.com/fsspec/kerchunk/issues/177#issuecomment-1156746064 successfully changes the references to work as expected (i.e. you get nan values instead of the fill value with decode_cf=True)

martindurant commented 2 years ago

@keewis , you mean the correct_fill_values function?

keewis commented 2 years ago

yes, although the the implementation feels a bit clunky

rsignell-usgs commented 2 years ago

@rabernat, hate to drag you into everything, but I bet you will immediately know what to do here, as you've been involved with xarray, the CF conventions & the xarray zarr writer.

The whole issue is explained in this This Reproducible Notebook (without credentials needed, as it uses OSN!)

rabernat commented 2 years ago

Rather than using string formatting to assess equality, I would be looking at raw bytes, e.g.

actual_value = ds.temp[0,0,0,0].data
actual_value_bytes = actual_value.tobytes()
assert actual_value_bytes == b'\xc2\xbd\xf0|'

I dug into this and noticed that the kerchunk dataset is converting the fill value to a float64 type, while the original (working) zarr on disk is converting it to a float32 (same as original netCDF).

Digging into this reveals some pretty weird things about how netCDF, Zarr, and Xarray handle fill values. Note, for example, that the on-disk zarr does not actually keep the fill value in the attributes at all:

! cat foo.zarr/temp/.zattrs
{
    "_ARRAY_DIMENSIONS": [
        "ocean_time",
        "s_rho",
        "eta_rho",
        "xi_rho"
    ],
    "coordinates": "lon_rho lat_rho s_rho ocean_time",
    "field": "temperature, scalar, series",
    "grid": "grid",
    "location": "face",
    "long_name": "potential temperature",
    "time": "ocean_time",
    "units": "Celsius"
}

Compare this to both the Kerchunk Zarr and the original NetCDF. Instead, the fill value lives in the zarr array attributes.

! cat foo.zarr/temp/.zarray
{
    "chunks": [
        1,
        4,
        84,
        448
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "<f4",
    "fill_value": 9.999999933815813e+36,
    "filters": null,
    "order": "C",
    "shape": [
        2,
        16,
        336,
        896
    ],
    "zarr_format": 2
}

Thus a key problem here is that there are redundant / overlapping mechanisms for specifying fill values. When we implemented the Zarr backend, we decided to lift the Zarr array fill_value into the dataset attributes when opening a new zarr array: https://github.com/pydata/xarray/blob/d7931f9014a26e712ff5f30c4082cf0261f045d3/xarray/backends/zarr.py#L453-L456 (You can also see how writing is handled here.)

I am no longer sure that this was the right choice. The zarr definition of fill_value, is defined as

fill_value: A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used.

It is only ever used by zarr if you request data from a chunk that has not been initialized. It is NOT a mask. This is not quite the same is the NUG FillValue:

The _FillValue attribute specifies the fill value used to pre-fill disk space allocated to the variable. Such pre-fill occurs unless nofill mode is set using nc_set_fill(). The fill value is returned when reading values that were never written. If _FillValue is defined then it should be scalar and of the same type as the variable. If the variable is packed using scale_factor and add_offset attributes (see below), the _FillValue attribute should have the data type of the packed data.

It is not necessary to define your own _FillValue attribute for a variable if the default fill value for the type of the variable is adequate. However, use of the default fill value for data type byte is not recommended. Note that if you change the value of this attribute, the changed value applies only to subsequent writes; previously written data are not changed.

Generic applications often need to write a value to represent undefined or missing values. The fill value provides an appropriate value for this purpose because it is normally outside the valid range and therefore treated as missing when read by generic applications. It is legal (but not recommended) for the fill value to be within the valid range.

The kerchunk Zarr now has default fill_value of 0.0 in .zarray

print(fs.cat('temp/.zarray').decode("utf-8"))
{
    "chunks": [
        1,
        8,
        168,
        448
    ],
    "compressor": {
        "id": "zlib",
        "level": 1
    },
    "dtype": "<f4",
    "fill_value": 0.0,
    "filters": [
        {
            "elementsize": 4,
            "id": "shuffle"
        }
    ],
    "order": "C",
    "shape": [
        12,
        16,
        336,
        896
    ],
    "zarr_format": 2
}

Okay so why does any of this matter? Because whether we use the zarr array internal fill_value or the attribute _FillValue seems to affect how these are decoded from json to binary data. Observe

z_array = zarr.open("foo.zarr")["temp"]
on_disk_zarr_fill_value = z_array.fill_value
assert z_array[0, 0, 0, 0] == on_disk_zarr_fill_value
on_disk_zarr_fill_value_bytes = on_disk_zarr_fill_value.tobytes()
print(type(on_disk_zarr_fill_value), on_disk_zarr_fill_value_bytes)
# -> <class 'numpy.float32'> b'\xc2\xbd\xf0|'

but

kerhunk_z_array = zarr.open(m)['temp']
kerchunk_attrs_fill_value = kerhunk_z_array.attrs["_FillValue"]
print(type(kerchunk_attrs_fill_value))  # -> float

However, I am mystified by why this still works:

assert kerhunk_z_array[0, 0, 0, 0] == kerchunk_attrs_fill_value

... but cf-decoding does not.

I am currently looking into the xarray coding code to understand what is going wrong.

In the meantime, one thing you might try is to modify the kerchunk dataset to use the zarr Array fill_value property, instead of the CF-style attribute.

rabernat commented 2 years ago

Ok something even more obvious.

dsk = xr.open_dataset(m, engine="zarr", chunks={}, 
                     backend_kwargs=dict(consolidated=False), decode_cf=False)
dsk.temp._FillValue  # -> 0.0

So the zarr array fill_value is actually overwriting the fill value already in the attributes by Xarray! Manually overriding does the trick

dsk.temp.attrs['_FillValue'] = kerchunk_attrs_fill_value
xr.decode_cf(dsk).temp[0, 0, 0, 0].values  # -> NaN

This may be an xarray bug. The problem is ultimately the duplicate and inconsistent fill_values.

rabernat commented 2 years ago

This is the specific place where the overwriting happens: https://github.com/pydata/xarray/blob/d7931f9014a26e712ff5f30c4082cf0261f045d3/xarray/backends/zarr.py#L455-L456

So there are two simultaneous paths forward:

rsignell-usgs commented 2 years ago

@rabernat you are awesome! Thanks for showing us the path(s)!

keewis commented 2 years ago

Regardless of that, kerchunk should strive to not produce inconsistent data (different values of array .fill_value and attribute _FillValue.

To be fair, the original file was already inconsistent (the HDF5 fill_value property has a different value from the _FillValue attribute – that's where the 0.0 comes from), so all kerchunk did was translate that to zarr.

That said, I agree that xarray should do something when it finds inconsistent fill values, in any file format.

rsignell-usgs commented 2 years ago

@keewis , you said the original NetCDF4 file had fill_value set, but I did:

 h5dump -H coawst_us_20220101_13.nc > h5dump.txt

and searching the resulting file for fill_value comes up empty.

The temp variable looks like this:

   DATASET "temp" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 12, 16, 336, 896 ) / ( H5S_UNLIMITED, 16, 336, 896 ) }
      ATTRIBUTE "DIMENSION_LIST" {
         DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
         DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      }
      ATTRIBUTE "_FillValue" {
         DATATYPE  H5T_IEEE_F32LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      }
      ATTRIBUTE "coordinates" {
         DATATYPE  H5T_STRING {
            STRSIZE 32;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "field" {
         DATATYPE  H5T_STRING {
            STRSIZE 27;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "grid" {
         DATATYPE  H5T_STRING {
            STRSIZE 4;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "location" {
         DATATYPE  H5T_STRING {
            STRSIZE 4;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "long_name" {
         DATATYPE  H5T_STRING {
            STRSIZE 21;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "time" {
         DATATYPE  H5T_STRING {
            STRSIZE 10;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
      ATTRIBUTE "units" {
         DATATYPE  H5T_STRING {
            STRSIZE 7;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
      }
   }

You saw the fill_value attribute when you read the dataset with h5py, right?

keewis commented 2 years ago

yes, exactly. I'm not sure where that gets the value from, though.

rsignell-usgs commented 2 years ago

So sounds like h5py also handles the fill values inconsistently!

If the HDF5 model uses fill_value to denote fill values, it seems it should be assigned the _FillValue value when the HDF5 file contains only that attribute.

@ajelenak, do you have insight here?

ajelenak commented 2 years ago

Fill value is a property of an HDF5 dataset (same as a netCDF variable or Zarr array) which is assigned at its creation. If it is not set explicitly, the HDF5 library assigns the default value which for all numerical datatypes is zero, I think. That should be shown with h5dump -p ... command. This is not an HDF5 attribute. H5py just uses what the library says is the fill value of that dataset.

_FillValue is a netCDF attribute (same as HDF5/Zarr attribute) and has no relation to the HDF5 dataset fill value. So you can, and often do, end up with these two values being different in netCDF-4 files. AFAIK, the netCDF library does not sync up these two values, likely because an attribute's value can be modified later but the HDF5 dataset fill value is immutable.

That part of the kerchunk code is still the original I wrote and it does the right thing in my opinion: translates HDF5 dataset fill value to Zarr array fill value.

How to interpret the Zarr array fill value vs its _FillValue Zarr atribute is up to the upstream software, like xarray, to decide. In case xarray is in the CF interpretation mode, it should use the _FillValue attribute value and disregard that Zarr array's fill value. Dealing with numerical rounding errors of tricky fill values like this one here is another issue.

rsignell-usgs commented 2 years ago

@ajelenak thanks for this explanation!

So if xarray sets the _FillValue attribute when writing Zarr, and reads the _FillValue attribute when reading Zarr, we should be fine, right?

ajelenak commented 2 years ago
with xarray.netcdf_and_cf_context as ds:
    answer = yes
rabernat commented 2 years ago

So if xarray sets the _FillValue attribute when writing Zarr, and reads the _FillValue attribute when reading Zarr, we should be fine, right?

This is explicitly how we coded it when we implemented the zarr backed. The array internal fill_value property (analogous to the HDF5 fill_value) is lifted in the _FillValue attribute when decoding from Zarr. And the reverse happens at write time.

I think the problem is that Kerchunk does not use the same code path as Xarray when creating a new Zarr store, so it does not have this logic of setting .fill_value on the Zarr array itself based on the _FillValue attribute. So the default .fill_value==0 is being used. My recommendation is to try to align the Kerchunk Zarr writer more closely with Xarray and, if possible, leverage Xarray's encoding / decoding.

Another problem is that Xarray just overwrites any existing _FillValue attribute without checking if it already exists (code here). Raising a warning or error would be better. That would have helped track down this bug more easily.

martindurant commented 2 years ago

OK, so kerchunk is happy to fill in the _FillValue attribute. I anticipate that in any tool that is not xarray, or with cf decoding off, this will simply be ignored. I'll make that change, we can rerun the fail case and see if all is good. Next week... :)

rsignell-usgs commented 2 years ago

@martindurant , I believe what we want is to make the Kerchunk .zarray fill_value match the _FillValue attribute for each variable.

martindurant commented 2 years ago

Can anyone suggest a simple test case for #181 ?

kthyng commented 11 months ago

This may or may not be directly related to this issue but it was what I found when I googled my problem so I'll add a comment here.

I found that if a scalar value being kerchunked from a file is 0.0 and the fill value is 0.0, it comes into the dataset as nan. As in:

out["hc/.zarray"]

returns

'{"chunks":[],"compressor":null,"dtype":"<f8","fill_value":0.0,"filters":null,"order":"C","shape":[],"zarr_format":2}'

and also I know from the original model output that the value of hc is 0.0. It has to be exactly this combination from what I can tell — other scalar values with fill_value of 0.0 weren't changed when examined in the resulting dataset. I was able to fix this by postprocessing my kerchunking while running MultiZarrToZarr like so:

fs2 = fsspec.filesystem('')  # local file system to save final jsons to

fname_overall_kerchunk = "kerchunk.json"
json_list = fs2.glob(f"{output_dir_single_files}/1999_000?.json") # combine single json files

def fix_fill_values(out):
    """Fix problem when fill_value and scara both equal 0.0.

    If the fill value and the scalar value are both 0, nan is filled instead. This fixes that.
    """

    for k in list(out):
        if isinstance(out[k], str) and '"fill_value":0.0' in out[k]:
            out[k] = out[k].replace('"fill_value":0.0', '"fill_value":"NaN"')
    return out

def postprocess(out):
    out = fix_fill_values(out)
    return out

mzz = MultiZarrToZarr(
    json_list,
    concat_dims=["ocean_time"],
    identical_dims= ['lat_rho', 'lon_rho', "lon_psi", "lat_psi",
                    "lat_u", "lon_u", "lat_v", "lon_v", 
                    "Akk_bak","Akp_bak","Akt_bak","Akv_bak","Cs_r","Cs_w",
                    "FSobc_in","FSobc_out","Falpha","Fbeta","Fgamma","Lm2CLM",
                    "Lm3CLM", "LnudgeM2CLM", "LnudgeM3CLM", "LnudgeTCLM",
                    "LsshCLM", "LtracerCLM", "LtracerSrc", "LuvSrc",
                    "LwSrc", "M2nudg", "M2obc_in", "M2obc_out", "M3nudg",
                    "M3obc_in", "M3obc_out", "Tcline", "Tnudg","Tobc_in", "Tobc_out",
                    "Vstretching", "Vtransform", "Znudg", "Zob", "Zos", "angle",
                    "dstart", "dt", "dtfast", "el", "f", "gamma2", "grid", "h",
                    "hc", "mask_psi", "mask_rho", "mask_u", "mask_v", "nHIS", "nRST",
                    "nSTA", "ndefHIS",  "ndtfast", "ntimes", "pm", "pn", "rdrg", 
                    "rdrg2", "rho0", "spherical", "theta_b", "theta_s", "xl",
                    ],
    coo_map = {"ocean_time": "cf:ocean_time",
               },
    postprocess=postprocess,
)

This could also be fixed in preprocessing but it seemed like it would be faster to do postprocessing one time per variable.

cc @martindurant this is the issue I mentioned earlier today.

martindurant commented 11 months ago

That pre/post processing function could be included a an example in kerchunk itself alongside the current drop() one. I'm not sure if there's a way we should be doing this at the scan stage given the discussion above about fill_value and _FillValue. A legitimate "0 means NaN" seems unlikely even for ints, where I would expect maybe -1, 9999, max_int for this purpose.

This was netCDF/HDF data? I wonder if there's a flag to say "no missing values" for a whole array.

kthyng commented 11 months ago

netCDF yes.

martindurant commented 11 months ago

You could put a breakpoint() in SinhleHdf5ToZarr._translator() to see if there's anything about the particular h5obj that says it has absolutely no missing values (cc @ajelenak )

ajelenak commented 11 months ago

Yes, debugging can help. Alternatively, the output of this command: h5dump -d /path/to/ncvar -p -A the_file.nc should provide all the relevant storage information.