Closed rsignell-usgs closed 2 years ago
My hunch was, that this is due to writing the value as text in the attributes of zarr. But that raises the question of how this is normally handled outside of kerchunk, and why decode_cf matters.
@rsignell-usgs , do you mind using xarray's to_zarr on that dataset to see what .zarray/.zattrs it creates? You can set the output chunk store (as opposed to the metadata store) to be a dict or even a custom dict that doesn't store anything (see below) to avoid writing things.
class NoOpDict(dict):
def __setitem__(self, *_):
pass
metadata_strore = {}
chunk_store = NoOpDict()
this may be #105: if you look at the NetCDF file with h5py
, you can see that _FillValue
is array([1.e+37], dtype=float32)
while the fillvalue
property has a value of 0.0
@keewis, what is "fillvalue"? I've never heard of that attribute being used for fill values. That must be an HDF5-specific thing?
@martindurant , when I write Zarr, it's even more confusing. The fill_value
is set to 9.999999e+36, but when read back into xarray with cf_decode=False
, the value is 1e37. ????
See cells [9] and [11] here: https://nbviewer.org/gist/2400089529be8bc3924aacfd3b2485e6
That must be an HDF5-specific thing?
yes, exactly. You can examine it with
In [1]: import fsspec
...: import h5py
...:
...: fs = fsspec.filesystem(
...: "s3", anon=True, client_kwargs={"endpoint_url": "https://mghp.osn.xsede.org"}
...: )
...: url = "s3://rsignellbucket1/COAWST/coawst_us_20220101_01.nc"
...:
...: f = h5py.File(fs.open(url))
...: f["temp"].fillvalue
Out[1]: 0.0
For float
values it is used to set zarr
's fill_value
attribute, which AFAIU the zarr
engine will use to replace _FillValue
.
You might have to synchronise fill_value
with _FillValue
in the references (e.g. by passing the hacky function from https://github.com/fsspec/kerchunk/issues/105#issuecomment-982834456 to MultiZarrToZarr
's preprocess
). For example:
In [1]: import fsspec
...:
...: fs = fsspec.filesystem(
...: "s3", anon=True, client_kwargs={"endpoint_url": "https://mghp.osn.xsede.org"}
...: )
...: url = "s3://rsignellbucket1/COAWST/coawst_us_20220101_01.nc"
...:
...: import kerchunk.hdf
...:
...: hdf = kerchunk.hdf.SingleHdf5ToZarr(fs.open(url), url=url)
...: references = hdf.translate()
In [2]: import xarray as xr
...:
...: r_opts = dict(anon=True, client_kwargs={"endpoint_url": "https://mghp.osn.xsede.org"})
In [3]: fs = fsspec.filesystem(
...: "reference", fo=references, remote_protocol="s3", remote_options=r_opts
...: )
...: m = fs.get_mapper("")
...:
...: ds = xr.open_dataset(
...: m,
...: engine="zarr",
...: chunks={},
...: backend_kwargs=dict(consolidated=False),
...: decode_cf=False,
...: )
In [4]: ds.temp.attrs
Out[4]:
{'_FillValue': 0.0,
'coordinates': 'lon_rho lat_rho s_rho ocean_time',
'field': 'temperature, scalar, series',
'grid': 'grid',
'location': 'face',
'long_name': 'potential temperature',
'time': 'ocean_time',
'units': 'Celsius'}
In [5]: ds.temp[0, 0, 0, 0].values
Out[5]: array(1.e+37, dtype=float32)
In [6]: decoded = xr.decode_cf(ds)
...: decoded.temp[0, 0, 0, 0].values
Out[6]: array(1.e+37, dtype=float32)
In [7]: import ujson
...: import itertools
...:
...:
...: def correct_fill_values(data):
...: def fix_variable(values):
...: zattrs = values[".zattrs"]
...:
...: if "_FillValue" not in zattrs:
...: return values
...:
...: _FillValue = zattrs["_FillValue"]
...: if values[".zarray"]["fill_value"] != _FillValue:
...: values[".zarray"]["fill_value"] = _FillValue
...:
...: return values
...:
...: refs = data["refs"]
...: prepared = (
...: (tuple(key.split("/")), value) for key, value in refs.items() if "/" in key
...: )
...: filtered = (
...: (key, ujson.loads(value))
...: for key, value in prepared
...: if key[1] in (".zattrs", ".zarray")
...: )
...: key = lambda i: i[0][0]
...: grouped = (
...: (name, {n[1]: v for n, v in group})
...: for name, group in itertools.groupby(sorted(filtered, key=key), key=key)
...: )
...: fixed = ((name, fix_variable(var)) for name, var in grouped)
...: flattened = {
...: f"{name}/{item}": ujson.dumps(data, indent=4)
...: for name, var in fixed
...: for item, data in var.items()
...: }
...: data["refs"] = dict(sorted((refs | flattened).items()))
...: return data
In [8]: corrected_refs = correct_fill_values(references)
In [9]: fs = fsspec.filesystem(
...: "reference", fo=corrected_refs, remote_protocol="s3", remote_options=r_opts
...: )
...: m = fs.get_mapper("")
...:
...: ds = xr.open_dataset(
...: m,
...: engine="zarr",
...: chunks={},
...: backend_kwargs=dict(consolidated=False),
...: decode_cf=False,
...: )
...: display(ds.temp.attrs, ds.temp[0, 0, 0, 0].values)
{'_FillValue': 1e+37,
'coordinates': 'lon_rho lat_rho s_rho ocean_time',
'field': 'temperature, scalar, series',
'grid': 'grid',
'location': 'face',
'long_name': 'potential temperature',
'time': 'ocean_time',
'units': 'Celsius'}
array(1.e+37, dtype=float32)
In [10]: decoded = xr.decode_cf(ds)
...: decoded.temp[0, 0, 0, 0].values
Out[10]: array(nan, dtype=float32)
The
fill_value
is set to 9.999999e+36, but when read back into xarray withcf_decode=False
, the value is 1e37
That might be a precision / formatting issue:
In [33]: np.float32(9.999999933815813e36)
Out[33]: 1e+37
In [34]: np.float64(9.999999933815813e36)
Out[34]: 9.999999933815813e+36
I have set up a reproducible notebook here manipulating some fill_values: https://gist.github.com/peterm790/6e6ddb2df0f095056837d21c531bbe57
It seems that only fill_value
in .zarray
is considered by xarray when opening as zarr from a kerchunk reference file and _FillValue
in .zattrs
is not considered.
The fill_value
in @rsignell-usgs example is 0 (which is incorrect) so I am not sure if manually setting it to 1e37 would correctly render Nan, but might be worth investigating. This example didn't reproduce the 1e37=> 9.99999993e+36 despite setting dtype = npfloat32
just for reference, the code at the end of https://github.com/fsspec/kerchunk/issues/177#issuecomment-1156746064 successfully changes the references to work as expected (i.e. you get nan
values instead of the fill value with decode_cf=True
)
@keewis , you mean the correct_fill_values
function?
yes, although the the implementation feels a bit clunky
@rabernat, hate to drag you into everything, but I bet you will immediately know what to do here, as you've been involved with xarray, the CF conventions & the xarray zarr writer.
The whole issue is explained in this This Reproducible Notebook (without credentials needed, as it uses OSN!)
Rather than using string formatting to assess equality, I would be looking at raw bytes, e.g.
actual_value = ds.temp[0,0,0,0].data
actual_value_bytes = actual_value.tobytes()
assert actual_value_bytes == b'\xc2\xbd\xf0|'
I dug into this and noticed that the kerchunk dataset is converting the fill value to a float64 type, while the original (working) zarr on disk is converting it to a float32 (same as original netCDF).
Digging into this reveals some pretty weird things about how netCDF, Zarr, and Xarray handle fill values. Note, for example, that the on-disk zarr does not actually keep the fill value in the attributes at all:
! cat foo.zarr/temp/.zattrs
{
"_ARRAY_DIMENSIONS": [
"ocean_time",
"s_rho",
"eta_rho",
"xi_rho"
],
"coordinates": "lon_rho lat_rho s_rho ocean_time",
"field": "temperature, scalar, series",
"grid": "grid",
"location": "face",
"long_name": "potential temperature",
"time": "ocean_time",
"units": "Celsius"
}
Compare this to both the Kerchunk Zarr and the original NetCDF. Instead, the fill value lives in the zarr array attributes.
! cat foo.zarr/temp/.zarray
{
"chunks": [
1,
4,
84,
448
],
"compressor": {
"blocksize": 0,
"clevel": 5,
"cname": "lz4",
"id": "blosc",
"shuffle": 1
},
"dtype": "<f4",
"fill_value": 9.999999933815813e+36,
"filters": null,
"order": "C",
"shape": [
2,
16,
336,
896
],
"zarr_format": 2
}
Thus a key problem here is that there are redundant / overlapping mechanisms for specifying fill values. When we implemented the Zarr backend, we decided to lift the Zarr array fill_value
into the dataset attributes when opening a new zarr array:
https://github.com/pydata/xarray/blob/d7931f9014a26e712ff5f30c4082cf0261f045d3/xarray/backends/zarr.py#L453-L456
(You can also see how writing is handled here.)
I am no longer sure that this was the right choice. The zarr definition of fill_value, is defined as
fill_value: A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used.
It is only ever used by zarr if you request data from a chunk that has not been initialized. It is NOT a mask. This is not quite the same is the NUG FillValue:
The _FillValue attribute specifies the fill value used to pre-fill disk space allocated to the variable. Such pre-fill occurs unless nofill mode is set using nc_set_fill(). The fill value is returned when reading values that were never written. If _FillValue is defined then it should be scalar and of the same type as the variable. If the variable is packed using scale_factor and add_offset attributes (see below), the _FillValue attribute should have the data type of the packed data.
It is not necessary to define your own _FillValue attribute for a variable if the default fill value for the type of the variable is adequate. However, use of the default fill value for data type byte is not recommended. Note that if you change the value of this attribute, the changed value applies only to subsequent writes; previously written data are not changed.
Generic applications often need to write a value to represent undefined or missing values. The fill value provides an appropriate value for this purpose because it is normally outside the valid range and therefore treated as missing when read by generic applications. It is legal (but not recommended) for the fill value to be within the valid range.
The kerchunk Zarr now has default fill_value
of 0.0 in .zarray
print(fs.cat('temp/.zarray').decode("utf-8"))
{
"chunks": [
1,
8,
168,
448
],
"compressor": {
"id": "zlib",
"level": 1
},
"dtype": "<f4",
"fill_value": 0.0,
"filters": [
{
"elementsize": 4,
"id": "shuffle"
}
],
"order": "C",
"shape": [
12,
16,
336,
896
],
"zarr_format": 2
}
Okay so why does any of this matter? Because whether we use the zarr array internal fill_value
or the attribute _FillValue
seems to affect how these are decoded from json to binary data. Observe
z_array = zarr.open("foo.zarr")["temp"]
on_disk_zarr_fill_value = z_array.fill_value
assert z_array[0, 0, 0, 0] == on_disk_zarr_fill_value
on_disk_zarr_fill_value_bytes = on_disk_zarr_fill_value.tobytes()
print(type(on_disk_zarr_fill_value), on_disk_zarr_fill_value_bytes)
# -> <class 'numpy.float32'> b'\xc2\xbd\xf0|'
but
kerhunk_z_array = zarr.open(m)['temp']
kerchunk_attrs_fill_value = kerhunk_z_array.attrs["_FillValue"]
print(type(kerchunk_attrs_fill_value)) # -> float
However, I am mystified by why this still works:
assert kerhunk_z_array[0, 0, 0, 0] == kerchunk_attrs_fill_value
... but cf-decoding does not.
I am currently looking into the xarray coding code to understand what is going wrong.
In the meantime, one thing you might try is to modify the kerchunk dataset to use the zarr Array fill_value
property, instead of the CF-style attribute.
Ok something even more obvious.
dsk = xr.open_dataset(m, engine="zarr", chunks={},
backend_kwargs=dict(consolidated=False), decode_cf=False)
dsk.temp._FillValue # -> 0.0
So the zarr array fill_value is actually overwriting the fill value already in the attributes by Xarray! Manually overriding does the trick
dsk.temp.attrs['_FillValue'] = kerchunk_attrs_fill_value
xr.decode_cf(dsk).temp[0, 0, 0, 0].values # -> NaN
This may be an xarray bug. The problem is ultimately the duplicate and inconsistent fill_values.
This is the specific place where the overwriting happens: https://github.com/pydata/xarray/blob/d7931f9014a26e712ff5f30c4082cf0261f045d3/xarray/backends/zarr.py#L455-L456
So there are two simultaneous paths forward:
.fill_value
and attribute _FillValue
.@rabernat you are awesome! Thanks for showing us the path(s)!
Regardless of that, kerchunk should strive to not produce inconsistent data (different values of array
.fill_value
and attribute_FillValue
.
To be fair, the original file was already inconsistent (the HDF5 fill_value
property has a different value from the _FillValue
attribute – that's where the 0.0
comes from), so all kerchunk did was translate that to zarr.
That said, I agree that xarray should do something when it finds inconsistent fill values, in any file format.
@keewis , you said the original NetCDF4 file had fill_value
set, but I did:
h5dump -H coawst_us_20220101_13.nc > h5dump.txt
and searching the resulting file for fill_value
comes up empty.
The temp
variable looks like this:
DATASET "temp" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 12, 16, 336, 896 ) / ( H5S_UNLIMITED, 16, 336, 896 ) }
ATTRIBUTE "DIMENSION_LIST" {
DATATYPE H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
}
ATTRIBUTE "_FillValue" {
DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE { ( 1 ) / ( 1 ) }
}
ATTRIBUTE "coordinates" {
DATATYPE H5T_STRING {
STRSIZE 32;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "field" {
DATATYPE H5T_STRING {
STRSIZE 27;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "grid" {
DATATYPE H5T_STRING {
STRSIZE 4;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "location" {
DATATYPE H5T_STRING {
STRSIZE 4;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "long_name" {
DATATYPE H5T_STRING {
STRSIZE 21;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "time" {
DATATYPE H5T_STRING {
STRSIZE 10;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
ATTRIBUTE "units" {
DATATYPE H5T_STRING {
STRSIZE 7;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
}
}
You saw the fill_value
attribute when you read the dataset with h5py, right?
yes, exactly. I'm not sure where that gets the value from, though.
So sounds like h5py also handles the fill values inconsistently!
If the HDF5 model uses fill_value
to denote fill values, it seems it should be assigned the _FillValue
value when the HDF5 file contains only that attribute.
@ajelenak, do you have insight here?
Fill value is a property of an HDF5 dataset (same as a netCDF variable or Zarr array) which is assigned at its creation. If it is not set explicitly, the HDF5 library assigns the default value which for all numerical datatypes is zero, I think. That should be shown with h5dump -p ...
command. This is not an HDF5 attribute. H5py just uses what the library says is the fill value of that dataset.
_FillValue
is a netCDF attribute (same as HDF5/Zarr attribute) and has no relation to the HDF5 dataset fill value. So you can, and often do, end up with these two values being different in netCDF-4 files. AFAIK, the netCDF library does not sync up these two values, likely because an attribute's value can be modified later but the HDF5 dataset fill value is immutable.
That part of the kerchunk code is still the original I wrote and it does the right thing in my opinion: translates HDF5 dataset fill value to Zarr array fill value.
How to interpret the Zarr array fill value vs its _FillValue
Zarr atribute is up to the upstream software, like xarray, to decide. In case xarray is in the CF interpretation mode, it should use the _FillValue
attribute value and disregard that Zarr array's fill value. Dealing with numerical rounding errors of tricky fill values like this one here is another issue.
@ajelenak thanks for this explanation!
So if xarray
sets the _FillValue
attribute when writing Zarr, and reads the _FillValue
attribute when reading Zarr, we should be fine, right?
with xarray.netcdf_and_cf_context as ds:
answer = yes
So if
xarray
sets the_FillValue
attribute when writing Zarr, and reads the_FillValue
attribute when reading Zarr, we should be fine, right?
This is explicitly how we coded it when we implemented the zarr backed. The array internal fill_value
property (analogous to the HDF5 fill_value) is lifted in the _FillValue
attribute when decoding from Zarr. And the reverse happens at write time.
I think the problem is that Kerchunk does not use the same code path as Xarray when creating a new Zarr store, so it does not have this logic of setting .fill_value
on the Zarr array itself based on the _FillValue
attribute. So the default .fill_value==0
is being used. My recommendation is to try to align the Kerchunk Zarr writer more closely with Xarray and, if possible, leverage Xarray's encoding / decoding.
Another problem is that Xarray just overwrites any existing _FillValue
attribute without checking if it already exists (code here). Raising a warning or error would be better. That would have helped track down this bug more easily.
OK, so kerchunk is happy to fill in the _FillValue
attribute. I anticipate that in any tool that is not xarray, or with cf decoding off, this will simply be ignored. I'll make that change, we can rerun the fail case and see if all is good. Next week... :)
@martindurant , I believe what we want is to make the Kerchunk .zarray
fill_value
match the _FillValue
attribute for each variable.
Can anyone suggest a simple test case for #181 ?
This may or may not be directly related to this issue but it was what I found when I googled my problem so I'll add a comment here.
I found that if a scalar value being kerchunked from a file is 0.0 and the fill value is 0.0, it comes into the dataset as nan. As in:
out["hc/.zarray"]
returns
'{"chunks":[],"compressor":null,"dtype":"<f8","fill_value":0.0,"filters":null,"order":"C","shape":[],"zarr_format":2}'
and also I know from the original model output that the value of hc
is 0.0. It has to be exactly this combination from what I can tell — other scalar values with fill_value of 0.0 weren't changed when examined in the resulting dataset. I was able to fix this by postprocessing my kerchunking while running MultiZarrToZarr
like so:
fs2 = fsspec.filesystem('') # local file system to save final jsons to
fname_overall_kerchunk = "kerchunk.json"
json_list = fs2.glob(f"{output_dir_single_files}/1999_000?.json") # combine single json files
def fix_fill_values(out):
"""Fix problem when fill_value and scara both equal 0.0.
If the fill value and the scalar value are both 0, nan is filled instead. This fixes that.
"""
for k in list(out):
if isinstance(out[k], str) and '"fill_value":0.0' in out[k]:
out[k] = out[k].replace('"fill_value":0.0', '"fill_value":"NaN"')
return out
def postprocess(out):
out = fix_fill_values(out)
return out
mzz = MultiZarrToZarr(
json_list,
concat_dims=["ocean_time"],
identical_dims= ['lat_rho', 'lon_rho', "lon_psi", "lat_psi",
"lat_u", "lon_u", "lat_v", "lon_v",
"Akk_bak","Akp_bak","Akt_bak","Akv_bak","Cs_r","Cs_w",
"FSobc_in","FSobc_out","Falpha","Fbeta","Fgamma","Lm2CLM",
"Lm3CLM", "LnudgeM2CLM", "LnudgeM3CLM", "LnudgeTCLM",
"LsshCLM", "LtracerCLM", "LtracerSrc", "LuvSrc",
"LwSrc", "M2nudg", "M2obc_in", "M2obc_out", "M3nudg",
"M3obc_in", "M3obc_out", "Tcline", "Tnudg","Tobc_in", "Tobc_out",
"Vstretching", "Vtransform", "Znudg", "Zob", "Zos", "angle",
"dstart", "dt", "dtfast", "el", "f", "gamma2", "grid", "h",
"hc", "mask_psi", "mask_rho", "mask_u", "mask_v", "nHIS", "nRST",
"nSTA", "ndefHIS", "ndtfast", "ntimes", "pm", "pn", "rdrg",
"rdrg2", "rho0", "spherical", "theta_b", "theta_s", "xl",
],
coo_map = {"ocean_time": "cf:ocean_time",
},
postprocess=postprocess,
)
This could also be fixed in preprocessing but it seemed like it would be faster to do postprocessing one time per variable.
cc @martindurant this is the issue I mentioned earlier today.
That pre/post processing function could be included a an example in kerchunk itself alongside the current drop()
one. I'm not sure if there's a way we should be doing this at the scan stage given the discussion above about fill_value and _FillValue. A legitimate "0 means NaN" seems unlikely even for ints, where I would expect maybe -1, 9999, max_int for this purpose.
This was netCDF/HDF data? I wonder if there's a flag to say "no missing values" for a whole array.
netCDF yes.
You could put a breakpoint() in SinhleHdf5ToZarr._translator() to see if there's anything about the particular h5obj
that says it has absolutely no missing values
(cc @ajelenak )
Yes, debugging can help. Alternatively, the output of this command: h5dump -d /path/to/ncvar -p -A the_file.nc
should provide all the relevant storage information.
In my original NetCDF4 file (HDF5 under the hood) the fill value
_FillValue
is set to1.e+37
for 32bit floats, and when loaded into xarray withdecode_cf=True
, the data values with 1e37 are correctly interpreted as fill values and set to NaN.In the kerchunk JSON, the
_FillValue
in the JSON is set to 9.99999993e+36 and when loaded into xarray withdecode_cf=True
, the data values are not interpreted as fill values, and show up as 9.99999993e+36 (not 1e37 and not NaN).Pretty confused on what is going on here!
Here is a Reproducible Notebook
cc @peterm790