calliope-project / calliope

A multi-scale energy systems modelling framework
https://www.callio.pe
Apache License 2.0
299 stars 94 forks source link

Incorrect net_cdf serialisation for top level attributes #614

Closed irm-codebase closed 4 months ago

irm-codebase commented 4 months ago

What happened?

When saving models into net_cdf files, our serialization algorithm will convert single list elements at the top level into strings.

Replication steps:

import calliope

model = calliope.examples.national_scale()

model_file = "outputs/model.nc"
out_file = "outputs/output.nc"

model._model_data.attrs["applied_math"] = ["a single value"]
model.to_netcdf(model_file)

test = calliope.read_netcdf(model_file)

print(model._model_data.attrs["applied_math"])

Here, "applied_math" will be a string, not a list.

Which operating systems have you used?

Version

v0.7

Relevant log output

No response

irm-codebase commented 4 months ago

Did some extra digging around. The core of the issue is that the netCDF standard does not really consider attributes as anything more than single values per key (https://cfconventions.org/cf-conventions/cf-conventions.html#_attributes).

xarray does not support saving data in the attributes of datasets beyond single values either, which is why we have our own serialization.

I'm playing around with using pre-defined encoders for this, but so far the produced string is too big for xarray... The encoding works perfectly fine, but the saving step does not.

Here's an example:

import codecs
import pickle

import calliope

model = calliope.examples.national_scale()
attrs_orignial = model._model_data.attrs.copy()

pickled_attrs = codecs.encode(pickle.dumps(attrs_orignial), "base64").decode()
model._model_data.attrs.clear()
model._model_data.attrs["encoded"] = pickled_attrs
unpickled_attrs = pickle.loads(codecs.decode(pickled_attrs.encode(), "base64"))

assert attrs_orignial == unpickled_attrs
irm-codebase commented 4 months ago

xarray provides interfaces for other formats, and it seems like zarr fits our use-case the best: it fully separates data and metadata, and is able to represent the second. By comparison, netcdf is less flexible because of the limited typing of attrs (no dictionary or list support).

Saving a full calliope model is as easy as:

import calliope
import xarray

model = calliope.examples.national_scale()
model._model_data.to_netcdf("outputs/model.zarr")
test = xarray.open_zarr("outputs/model.zarr")

In this case, test contains the full model definition, with 0 errors.

It seems like zarr is newer (as in, has been around for almost a decade, against 30+ years for netcdf). It also comes from the geospatial / climate science field:

brynpickering commented 4 months ago

https://help.marine.copernicus.eu/en/articles/8176692-how-to-choose-between-netcdf-and-zarr-format-using-the-toolbox

Summary of summary table: zarr is better than netcdf in every way 😄

netcdf storing single list elements as strings is a bit of a pain. We manage all the other formats that need converting before storing in io.py (dictionaries, sets, etc.). We could easily just add a serialiser for lists so that it catches single list items. We probably should.

Happy for you to add functionality to convert to .zarr and to deprecate saving to .nc. Don't remove it entirely as it should be phased out at a later point - just have a deprecation warning with a recommendation to use .zarr. And then we just add this serialiser for single element lists so that they are returned as single element lists.

irm-codebase commented 4 months ago

I agree on keeping netCDF for a while, its too popular. I suspect our current serialisation will be enough once #619 is fixed, but it's a toss.

I'll come back to this one afterwards.