Open wd60622 opened 1 month ago
What is the error message?
Here's the error message:
TypeError Traceback (most recent call last)
File ~/GitHub/pymc-eco/pymc-marketing/idata.py:22
20 file = "check-serializable.json"
21 idata.attrs = attrs_doesnt_work
---> 22 idata.to_netcdf(file)
23 # except TypeError as e:
24 # print(e)
25 #
(...)
33 # assert idata_loaded.attrs == attrs_doesnt_work
34 #
File ~/micromamba/envs/pymc-marketing-dev/lib/python3.10/site-packages
/arviz/data/inference_data.py:486, in InferenceData.to_netcdf(self, fi
lename, compress, groups, engine, base_group, overwrite_existing)
483 mode = "w" # overwrite first, then append
485 if self._attrs:
--> 486 xr.Dataset(attrs=self._attrs).to_netcdf(
487 filename, mode=mode, engine=engine, group=base_group
488 )
489 mode = "a"
491 if self._groups_all: # check's whether a group is present or
not.
File ~/micromamba/envs/pymc-marketing-dev/lib/python3.10/site-packages
/xarray/core/dataset.py:2327, in Dataset.to_netcdf(self, path, mode, f
ormat, group, engine, encoding, unlimited_dims, compute, invalid_netcd
f)
2324 encoding = {}
2325 from xarray.backends.api import to_netcdf
-> 2327 return to_netcdf( # type: ignore # mypy cannot resolve the o
verloads:(
2328 self,
2329 path,
2330 mode=mode,
2331 format=format,
2332 group=group,
2333 engine=engine,
2334 encoding=encoding,
2335 unlimited_dims=unlimited_dims,
2336 compute=compute,
2337 multifile=False,
2338 invalid_netcdf=invalid_netcdf,
2339 )
File ~/micromamba/envs/pymc-marketing-dev/lib/python3.10/site-packages
/xarray/backends/api.py:1290, in to_netcdf(dataset, path_or_file, mode
, format, group, engine, encoding, unlimited_dims, compute, multifile,
invalid_netcdf)
1288 # validate Dataset keys, DataArray names, and attr keys/values
1289 _validate_dataset_names(dataset)
-> 1290 _validate_attrs(dataset, invalid_netcdf=invalid_netcdf and eng
ine == "h5netcdf")
1292 try:
1293 store_open = WRITEABLE_STORES[engine]
File ~/micromamba/envs/pymc-marketing-dev/lib/python3.10/site-packages
/xarray/backends/api.py:201, in _validate_attrs(dataset, invalid_netcd
f)
199 # Check attrs on the dataset itself
200 for k, v in dataset.attrs.items():
--> 201 check_attr(k, v, valid_types)
203 # Check attrs on each variable within the dataset
204 for variable in dataset.variables.values():
File ~/micromamba/envs/pymc-marketing-dev/lib/python3.10/site-packages
/xarray/backends/api.py:193, in _validate_attrs.<locals>.check_attr(na
me, value, valid_types)
187 raise TypeError(
188 f"Invalid name for attr: {name!r} must be a string for
"
189 "serialization to netCDF files"
190 )
192 if not isinstance(value, valid_types):
--> 193 raise TypeError(
194 f"Invalid value for attr {name!r}: {value!r}. For seri
alization to "
195 "netCDF files, its value must be of one of the followi
ng types: "
196 f"{', '.join([vtype.__name__ for vtype in valid_types]
)}"
197 )
TypeError: Invalid value for attr 'none_type': None. For serialization
to netCDF files, its value must be of one of the following types: str
, Number, ndarray, number, list, tuple
Definitely closely related with how xarray stores their data
My main recomendation would be to use zarr instead. It is faster and supports none in attributes.
As for the feature, if you check what json.dumps
is actually doing the two generated attributes works and don't work can't round-trip:
In [7]: json.dumps(attrs_works)
Out[7]: '{"none_type": "null", "list_type": ["A", "B", "C"], "int_type": 1, "float_type": 1.0}'
In [8]: json.dumps(attrs_doesnt_work)
Out[8]: '{"none_type": null, "list_type": ["A", "B", "C"], "int_type": 1, "float_type": 1.0}'
so when you use json.loads
on these dumped strings, the output for the "working version" has the string "null"
instead of None
. If we wanted to support that we'd have to json.dumps
on idata.attrs
directly, replace attrs with a single key holding that string before writing, then json.loads
on that single key attrs object if present (we could use __to_be_json_loaded__
as the key to indicate that) and replace the attrs again to have the actual dictionary.
Given this is about inferencedata attributes, this should be implemented here, and I think we'd all be fine with a PR, but that might become more confusing down the line because idata.posterior.attrs
or idata.posterior.variable.attrs
will behave differently than the top level ones 🤔.
So again, I think the optimal solution is moving to zarr instead of working on patches on top of netcdf based stuff. I know for now netcdf/h5netcdf is a dependency of arviz, but the idea is for it to stop being it (and it already is only an optional dependency of arviz-base
)
We could have an option to "sanitize" attrs or even user could create one for their need. That said, in the long run it is better to be aware what goes in to attrs and have a "system" of good habits.
I personally keep everything as a JSON string there. That ofc limits what I save in there, but really is a good way to keep attrs netcdf compatible.
Tell us about it
I am writing an InferenceData to netcdf where the InferenceData uses the
attrs
where some of the values areNoneType
. This type is json serializable withjson.dumps
. However, there is ValueError from an arviz check.Workaround is to use json.dumps on the on None before making it a value of the
attrs
. However, that then requires it to be parsed from json afteraz.from_netcdf
which is an additional stepExample script:
Thoughts on implementation
None type would be support in the attrs of InferenceData Note: this is coming from xarray so might be better issue there. However, I get a bit of a different issue