arviz-devs / arviz

Exploratory analysis of Bayesian models with Python
https://python.arviz.org
Apache License 2.0
1.58k stars 394 forks source link

Support for None in attrs for `to_netcdf` #2365

Open wd60622 opened 1 month ago

wd60622 commented 1 month ago

Tell us about it

I am writing an InferenceData to netcdf where the InferenceData uses the attrs where some of the values are NoneType. This type is json serializable with json.dumps. However, there is ValueError from an arviz check.

Workaround is to use json.dumps on the on None before making it a value of the attrs. However, that then requires it to be parsed from json after az.from_netcdf which is an additional step

Example script:

import arviz as az
import json

# Dummy idata
idata = az.InferenceData()

attrs_doesnt_work = {
    "none_type": None,
    "list_type": ["A", "B", "C"],
    "int_type": 1,
    "float_type": 1.0,
}
attrs_works = attrs_doesnt_work.copy()
attrs_works["none_type"] = json.dumps(attrs_works["none_type"])

json_string = json.dumps(attrs_doesnt_work)
print(json_string)

file = "check-serializable.json"
idata.attrs = attrs_doesnt_work
try:
    idata.to_netcdf(file)
except TypeError as e:
    print(e)

idata.attrs = attrs_works
idata.to_netcdf(file)

idata_loaded = az.from_netcdf(file)
print(idata_loaded.attrs)
idata_loaded.attrs["none_type"] = json.loads(idata_loaded.attrs["none_type"])

assert idata_loaded.attrs == attrs_doesnt_work

Thoughts on implementation

None type would be support in the attrs of InferenceData Note: this is coming from xarray so might be better issue there. However, I get a bit of a different issue

ahartikainen commented 1 month ago

What is the error message?

wd60622 commented 1 month ago

Here's the error message:

TypeError                            Traceback (most recent call last)
File ~/GitHub/pymc-eco/pymc-marketing/idata.py:22
     20 file = "check-serializable.json"
     21 idata.attrs = attrs_doesnt_work
---> 22 idata.to_netcdf(file)
     23 # except TypeError as e:
     24 #     print(e)
     25 #
   (...)
     33 # assert idata_loaded.attrs == attrs_doesnt_work
     34 #

File ~/micromamba/envs/pymc-marketing-dev/lib/python3.10/site-packages
/arviz/data/inference_data.py:486, in InferenceData.to_netcdf(self, fi
lename, compress, groups, engine, base_group, overwrite_existing)
    483     mode = "w"  # overwrite first, then append
    485 if self._attrs:
--> 486     xr.Dataset(attrs=self._attrs).to_netcdf(
    487         filename, mode=mode, engine=engine, group=base_group
    488     )
    489     mode = "a"
    491 if self._groups_all:  # check's whether a group is present or 
not.

File ~/micromamba/envs/pymc-marketing-dev/lib/python3.10/site-packages
/xarray/core/dataset.py:2327, in Dataset.to_netcdf(self, path, mode, f
ormat, group, engine, encoding, unlimited_dims, compute, invalid_netcd
f)
   2324     encoding = {}
   2325 from xarray.backends.api import to_netcdf
-> 2327 return to_netcdf(  # type: ignore  # mypy cannot resolve the o
verloads:(
   2328     self,
   2329     path,
   2330     mode=mode,
   2331     format=format,
   2332     group=group,
   2333     engine=engine,
   2334     encoding=encoding,
   2335     unlimited_dims=unlimited_dims,
   2336     compute=compute,
   2337     multifile=False,
   2338     invalid_netcdf=invalid_netcdf,
   2339 )

File ~/micromamba/envs/pymc-marketing-dev/lib/python3.10/site-packages
/xarray/backends/api.py:1290, in to_netcdf(dataset, path_or_file, mode
, format, group, engine, encoding, unlimited_dims, compute, multifile,
 invalid_netcdf)
   1288 # validate Dataset keys, DataArray names, and attr keys/values
   1289 _validate_dataset_names(dataset)
-> 1290 _validate_attrs(dataset, invalid_netcdf=invalid_netcdf and eng
ine == "h5netcdf")
   1292 try:
   1293     store_open = WRITEABLE_STORES[engine]

File ~/micromamba/envs/pymc-marketing-dev/lib/python3.10/site-packages
/xarray/backends/api.py:201, in _validate_attrs(dataset, invalid_netcd
f)
    199 # Check attrs on the dataset itself
    200 for k, v in dataset.attrs.items():
--> 201     check_attr(k, v, valid_types)
    203 # Check attrs on each variable within the dataset
    204 for variable in dataset.variables.values():

File ~/micromamba/envs/pymc-marketing-dev/lib/python3.10/site-packages
/xarray/backends/api.py:193, in _validate_attrs.<locals>.check_attr(na
me, value, valid_types)
    187     raise TypeError(
    188         f"Invalid name for attr: {name!r} must be a string for
 "
    189         "serialization to netCDF files"
    190     )
    192 if not isinstance(value, valid_types):
--> 193     raise TypeError(
    194         f"Invalid value for attr {name!r}: {value!r}. For seri
alization to "
    195         "netCDF files, its value must be of one of the followi
ng types: "
    196         f"{', '.join([vtype.__name__ for vtype in valid_types]
)}"
    197     )

TypeError: Invalid value for attr 'none_type': None. For serialization
 to netCDF files, its value must be of one of the following types: str
, Number, ndarray, number, list, tuple

Definitely closely related with how xarray stores their data

OriolAbril commented 2 weeks ago

My main recomendation would be to use zarr instead. It is faster and supports none in attributes.

As for the feature, if you check what json.dumps is actually doing the two generated attributes works and don't work can't round-trip:

In [7]: json.dumps(attrs_works)
Out[7]: '{"none_type": "null", "list_type": ["A", "B", "C"], "int_type": 1, "float_type": 1.0}'

In [8]: json.dumps(attrs_doesnt_work)
Out[8]: '{"none_type": null, "list_type": ["A", "B", "C"], "int_type": 1, "float_type": 1.0}'

so when you use json.loads on these dumped strings, the output for the "working version" has the string "null" instead of None. If we wanted to support that we'd have to json.dumps on idata.attrs directly, replace attrs with a single key holding that string before writing, then json.loads on that single key attrs object if present (we could use __to_be_json_loaded__ as the key to indicate that) and replace the attrs again to have the actual dictionary.

Given this is about inferencedata attributes, this should be implemented here, and I think we'd all be fine with a PR, but that might become more confusing down the line because idata.posterior.attrs or idata.posterior.variable.attrs will behave differently than the top level ones 🤔.

So again, I think the optimal solution is moving to zarr instead of working on patches on top of netcdf based stuff. I know for now netcdf/h5netcdf is a dependency of arviz, but the idea is for it to stop being it (and it already is only an optional dependency of arviz-base)

ahartikainen commented 2 weeks ago

We could have an option to "sanitize" attrs or even user could create one for their need. That said, in the long run it is better to be aware what goes in to attrs and have a "system" of good habits.

I personally keep everything as a JSON string there. That ofc limits what I save in there, but really is a good way to keep attrs netcdf compatible.