NeurodataWithoutBorders / lindi

Linked Data Interface (LINDI) - cloud-friendly access to NWB data
BSD 3-Clause "New" or "Revised" License
5 stars 1 forks source link

Handling NaN in attributes and elsewhere #11

Closed rly closed 8 months ago

rly commented 8 months ago

From @magland:

NaN in attributes. kerchunk allows non-json-compliant .zattrs, but I don't think we should do that. It should be json-compliant, but we do need a way to represent NaN.

[in our kerchunk-like reference file system JSON file]

Background

NaN, Infinity, and -Infinity are special IEEE 754 float values. NaN is commonly used in scientific data and supported in HDF5. NaN, Infinity, and -Infinity are not valid JSON, and some JSON parsers in some languages do not support these values.

By default, the Python json module dumps float('nan'), float('Inf'), and float('-Inf') to string/file as floats, but this produces technically invalid JSON. The json module can load those JSON values as their Python float values. There are flags to turn off this behavior in the native json module and other Python JSON parsers.

Attributes and dataset fill values in Zarr are stored as JSON.

NaN attribute in Zarr

Zarr-python uses Python's json module with default nan settings and writes a NaN-valued attribute as a float:

store = zarr.DirectoryStore("example-nan.zarr")
root = zarr.group(store=store, overwrite=True)
foo = root.attrs["foo_attr1"] = np.nan
f = open("example-nan.zarr/.zattrs")
print(f.read())
{
    "foo_attr1": NaN
}

Zarr-python reads this as a float, but this is not valid JSON. Even the GitHub json code colorizer complains that NaN is invalid JSON. Other zarr APIs may not be able to parse .zattrs or read NaN as a float. It seems like NaN in attributes may be invalid according to the Zarr spec.

NaN fill value in Zarr

Zarr-python supports setting a dataset fill_value to float('nan'), float('Inf'), and float('-Inf'), and on write, it specially dumps those as the strings "NaN", "Infinity", and "-Infinity". On read, it parses those strings as np.nan, np.PINF, and np.NINF. For example, a .zarray looks like:

{
    "chunks": [
        2
    ],
    "compressor": null,
    "dtype": "<f8",
    "fill_value": "NaN",
    "filters": null,
    "order": "C",
    "shape": [
        2
    ],
    "zarr_format": 2
}

ref: https://zarr.readthedocs.io/en/stable/spec/v2.html#fill-value-encoding

I'm not sure how other Zarr APIs read and write NaN attributes and fill values.

What to do here

In NWB HDF5 files, we sometimes have NaN in attributes. We could encode those as strings, just like Zarr-python encodes a NaN fill value as a string. However, a dataset fill value always has an accompanying dtype, and Zarr attribute values do not have dtypes. They are arbitrary JSON object literals. So looking only at the data, we will not know whether an attribute value ["NaN", "NaN"] should be encoded as 2 float float('nan')s or 2 string "NaN"s. Fortunately, in NWB, attributes are specified with a dtype in the schema, so if we look at schema, we can parse this correctly.

I suggest that we encode Python float('nan'), float('Inf'), and float('-Inf') values as the strings "NaN", "Infinity", and "-Infinity" so that they are valid JSON. Zarr-python will read these special strings in attributes as strings, so we should note this for general users. But we can amend PyNWB to convert these strings to floats if the dtype for the attribute spec is float/numeric. Zarr-python will read these special strings in fill values correctly because we use the same string encoding.

This code defines a custom encoder FloatJSONEncoder that we can pass to json.dump to parse through all floating point values in the JSON and convert float('nan'), float('Inf'), and float('-Inf') values to the strings "NaN", "Infinity", and "-Infinity".

See also:

rly commented 8 months ago

This might also be an issue in https://github.com/hdmf-dev/hdmf-zarr. I haven't tested that.

magland commented 8 months ago

@rly and I discussed this further over slack, and here's what we have decided for now.

In Zarr attributes, the strings 'NaN', 'Infinity', '-Infinity' represent float values of nan inf, ninf when converted back to hdf5. This includes when these strings are items in arrays of attributes.

These special string values are not allowed in the attributes of the original hdf5 file, and if they are present, and exception will be raised. This could pose a problem at some point, but we will cross that bridge when we get to it.

On a related note, for now we have opted to keep the hdf5 -> zarr simple and not try to do anything fancy with retaining data types in attributes for float32, float64, int32, uint32, etc, or bytes vs str. This information is lost in the conversion to JSON. Note that this only applies to attributes, not datasets. In datasets, the dtype is always retained. Although there could be an issue with compound types which are currently json encoded, and we'll need to think more about this.

rly commented 8 months ago

To add, the Zarr v2 and v3 specs allow attributes to be JSON objects, but without any specification of how those JSON objects are structured. The Zarr team discussed potentially storing precise data types in attributes by using a JSON object with a type key (and maybe also a shape key) for example:

{
    "dtype": "<f8",
    "value": "NaN"
}

or

{
    "dtype": "<f8",
    "value": [1, "NaN"]
}

We could adopt this convention down the road, but we would have to maintain the mapping/spec.