calliope-project / calliope

A multi-scale energy systems modelling framework
https://www.callio.pe
Apache License 2.0
276 stars 89 forks source link

Ensuring data and model configuration remain in sync at all times (model, backend, backendmath) #619

Open irm-codebase opened 1 week ago

irm-codebase commented 1 week ago

What happened?

What is happening?

Our current approach of handling data and configuration updates is inconsistent across the code:

Similarly, the model._model_data object is handled inconsistently across modules. See this example:

id(model._model_data)
140507415076176
id(model.math_documentation.inputs)
140507415076176
id(model.backend.inputs)  # shallow copy
140507414858960

Lastly, there is another variable with similar behavior, but defined at a different level: model._model_def_dict. This means that we are handling similar parameters at different levels.

Why is this a problem?

As it stands, you cannot be 100% sure that, for a saved file, the solved data and the stored metadata / configuration correspond. This is a severe issue for replicability!. For example: you can never be sure if your solved file is the original "planning" solution, or a test you ran on operate mode.

If you solve and save different models with different configurations, modes and math files, you should be able to reconstruct how that data was created via the configuration.

How to solve this?

I propose to introduce the following design requirements:

  1. model._model_data shall not be duplicated or shallow copied by underlying functionality.
  2. All modules that are not model.py should strive to only realize operations with model._model_data, and not modify it unless absolutely necessary (like saving results!).
  3. Configuration, math, and model definition must be saved and handled at the model level (i.e., no more saving stuff at model._model_data.attrs. This is to encourage better practices and avoid desync issues / losing sight of were things are. When saving files, we can serialize and only then add things to model._model_data.attrs.
  4. Configuration and model definition updates must always be saved to model. Is up to users to decide if they save the results or not, but the calliope model file should always tell them what it will do.
  5. "Warm start" updates to built models must reflect updates to parameters both in the backend and in model._model_data.

This helps us keep the code clean, ensures we better follow netCDF specifications (since it cannot handle lists or dictionaries), and should reduce bugs in the future.

Which operating systems have you used?

Version

v0.7

Relevant log output

No response

brynpickering commented 1 week ago

There's a bit of a compromise solution required to handle when configuration is updated on-the-fly. I agree that we need to store the configuration specific to each unique build/solve run, but I prefer the initialised configuration to remain a known entity that you overwrite for a specific build/solve, but that remains as a static underlying config for each independent build/solve.

So, model.build(mode=operate) and then model.build(force=True) should not build the same model in operate mode both times. The second time, it should revert to the base config (probably mode=plan). Perhaps the best approach would be to store these kwarg config overrides in a separate attribute so that the metadata stored in file reflects the build/solve phases that took place to achieve the results. So if you do lots of re-building/solving of the model, some dictionary like kwarg_config is kept up-to-date with the latest kwargs and then when you save the model to file, it superimposes the kwarg_config onto the main config, cementing that config in history.

brynpickering commented 1 week ago

We had another issue open about results vs inputs (not sure where it is now), in that perhaps we do want results and inputs to be stored in two separate datasets, rather than having a model._model_data that a user only ever gets filtered views onto. This would help create a consistent approach to not modifying model._model_data (or, indeed, not having it at all)

irm-codebase commented 1 week ago
brynpickering commented 1 week ago

I would prefer not having _override used as we already use that term a lot. It will create confusion w.r.t. the overrides key in the YAML.

irm-codebase commented 6 days ago

After some discussion between @brynpickering and I, this is the setup I've come up with. It's meant to enable easier debugging.

The most important feature is the breakdown of .config. When passing stuff to the backend, the priority order should be ._runtime > ._config_kwargs > ._config_overrides > ._config_file.

For now, I've added a _ to most things. In reality, I think we do not need it for math, default, data and timestamps. For config, a @property that combines all should be introduced.

Object Old setup New setup Comment Type Backend access? netCDF metadata?
Model() .math \| ._model_data.attrs["math"] ._math AttrDict yes yes
._model_data ._data xarray yes yes
.config \| ._model_data.attrs["config"] ._config_file Not expected to be modified whatsoever. AttrDict yes
.defaults \| ._model_data.attrs["defaults"] ._defaults AttrDict yes yes
._model_def_path ._def_path Path
._timings \| ._model_data.attrs["timestamp xxxxx"] ._timestamps Both do the same. AttrDict yes
._model_def_dict Eliminated. Should be local variable.
._config_override Saves overrides given during init AttrDict yes
._config_kwargs Saves extra **kwargs given during init, build() and solve() AttrDict yes
._model_data.attrs["calliope_version_initialized"] \| ._model_data.attrs["scenario"] \| ._model_data.attrs["allow_operate_mode"] \| ._model_data.attrs["applied_additional_math"] ._runtime Contains extra flags/variables we set during execution. AttrDict yes