Closed rly closed 8 months ago
This might also be an issue in https://github.com/hdmf-dev/hdmf-zarr. I haven't tested that.
@rly and I discussed this further over slack, and here's what we have decided for now.
In Zarr attributes, the strings 'NaN', 'Infinity', '-Infinity' represent float values of nan inf, ninf when converted back to hdf5. This includes when these strings are items in arrays of attributes.
These special string values are not allowed in the attributes of the original hdf5 file, and if they are present, and exception will be raised. This could pose a problem at some point, but we will cross that bridge when we get to it.
On a related note, for now we have opted to keep the hdf5 -> zarr simple and not try to do anything fancy with retaining data types in attributes for float32, float64, int32, uint32, etc, or bytes vs str. This information is lost in the conversion to JSON. Note that this only applies to attributes, not datasets. In datasets, the dtype is always retained. Although there could be an issue with compound types which are currently json encoded, and we'll need to think more about this.
To add, the Zarr v2 and v3 specs allow attributes to be JSON objects, but without any specification of how those JSON objects are structured. The Zarr team discussed potentially storing precise data types in attributes by using a JSON object with a type key (and maybe also a shape key) for example:
{
"dtype": "<f8",
"value": "NaN"
}
or
{
"dtype": "<f8",
"value": [1, "NaN"]
}
We could adopt this convention down the road, but we would have to maintain the mapping/spec.
From @magland:
[in our kerchunk-like reference file system JSON file]
Background
NaN, Infinity, and -Infinity are special IEEE 754 float values. NaN is commonly used in scientific data and supported in HDF5. NaN, Infinity, and -Infinity are not valid JSON, and some JSON parsers in some languages do not support these values.
By default, the Python
json
module dumpsfloat('nan')
,float('Inf')
, andfloat('-Inf')
to string/file as floats, but this produces technically invalid JSON. Thejson
module can load those JSON values as their Python float values. There are flags to turn off this behavior in the nativejson
module and other Python JSON parsers.Attributes and dataset fill values in Zarr are stored as JSON.
NaN attribute in Zarr
Zarr-python uses Python's
json
module with default nan settings and writes a NaN-valued attribute as a float:Zarr-python reads this as a float, but this is not valid JSON. Even the GitHub json code colorizer complains that NaN is invalid JSON. Other zarr APIs may not be able to parse
.zattrs
or read NaN as a float. It seems like NaN in attributes may be invalid according to the Zarr spec.NaN fill value in Zarr
Zarr-python supports setting a dataset
fill_value
tofloat('nan')
,float('Inf')
, andfloat('-Inf')
, and on write, it specially dumps those as the strings"NaN"
,"Infinity"
, and"-Infinity"
. On read, it parses those strings asnp.nan
,np.PINF
, andnp.NINF
. For example, a.zarray
looks like:ref: https://zarr.readthedocs.io/en/stable/spec/v2.html#fill-value-encoding
I'm not sure how other Zarr APIs read and write NaN attributes and fill values.
What to do here
In NWB HDF5 files, we sometimes have NaN in attributes. We could encode those as strings, just like Zarr-python encodes a NaN fill value as a string. However, a dataset fill value always has an accompanying dtype, and Zarr attribute values do not have dtypes. They are arbitrary JSON object literals. So looking only at the data, we will not know whether an attribute value
["NaN", "NaN"]
should be encoded as 2 floatfloat('nan')
s or 2 string"NaN"
s. Fortunately, in NWB, attributes are specified with a dtype in the schema, so if we look at schema, we can parse this correctly.I suggest that we encode Python
float('nan')
,float('Inf')
, andfloat('-Inf')
values as the strings"NaN"
,"Infinity"
, and"-Infinity"
so that they are valid JSON. Zarr-python will read these special strings in attributes as strings, so we should note this for general users. But we can amend PyNWB to convert these strings to floats if the dtype for the attribute spec is float/numeric. Zarr-python will read these special strings in fill values correctly because we use the same string encoding.This code defines a custom encoder
FloatJSONEncoder
that we can pass tojson.dump
to parse through all floating point values in the JSON and convertfloat('nan')
,float('Inf')
, andfloat('-Inf')
values to the strings"NaN"
,"Infinity"
, and"-Infinity"
.See also: