What is happening?

Our current approach of handling data and configuration updates is inconsistent across the code:

Sometimes configuration updates are passed to model._model_data (e.g., during initialization using a preexisting file).
Sometimes the configuration update is lost, but the model data is modified! (e.g., requesting operation mode for a solved model).

Similarly, the model._model_data object is handled inconsistently across modules. See this example:

id(model._model_data)
140507415076176
id(model.math_documentation.inputs)
140507415076176
id(model.backend.inputs)  # shallow copy
140507414858960

Lastly, there is another variable with similar behavior, but defined at a different level: model._model_def_dict. This means that we are handling similar parameters at different levels.

Why is this a problem?

As it stands, you cannot be 100% sure that, for a saved file, the solved data and the stored metadata / configuration correspond. This is a severe issue for replicability!. For example: you can never be sure if your solved file is the original "planning" solution, or a test you ran on operate mode.

If you solve and save different models with different configurations, modes and math files, you should be able to reconstruct how that data was created via the configuration.

How to solve this?

I propose to introduce the following design requirements:

model._model_data shall not be duplicated or shallow copied by underlying functionality.
All modules that are not model.py should strive to only realize operations with model._model_data, and not modify it unless absolutely necessary (like saving results!).
Configuration, math, and model definition must be saved and handled at the model level (i.e., no more saving stuff at model._model_data.attrs. This is to encourage better practices and avoid desync issues / losing sight of were things are. When saving files, we can serialize and only then add things to model._model_data.attrs.
Configuration and model definition updates must always be saved to model. Is up to users to decide if they save the results or not, but the calliope model file should always tell them what it will do.
"Warm start" updates to built models must reflect updates to parameters both in the backend and in model._model_data.

This helps us keep the code clean, ensures we better follow netCDF specifications (since it cannot handle lists or dictionaries), and should reduce bugs in the future.

Which operating systems have you used?

[ ] macOS
[ ] Windows
[ ] Linux

Version

v0.7

Relevant log output

No response

brynpickering commented 1 week ago

There's a bit of a compromise solution required to handle when configuration is updated on-the-fly. I agree that we need to store the configuration specific to each unique build/solve run, but I prefer the initialised configuration to remain a known entity that you overwrite for a specific build/solve, but that remains as a static underlying config for each independent build/solve.

So, model.build(mode=operate) and then model.build(force=True) should not build the same model in operate mode both times. The second time, it should revert to the base config (probably mode=plan). Perhaps the best approach would be to store these kwarg config overrides in a separate attribute so that the metadata stored in file reflects the build/solve phases that took place to achieve the results. So if you do lots of re-building/solving of the model, some dictionary like kwarg_config is kept up-to-date with the latest kwargs and then when you save the model to file, it superimposes the kwarg_config onto the main config, cementing that config in history.

brynpickering commented 1 week ago

We had another issue open about results vs inputs (not sure where it is now), in that perhaps we do want results and inputs to be stored in two separate datasets, rather than having a model._model_data that a user only ever gets filtered views onto. This would help create a consistent approach to not modifying model._model_data (or, indeed, not having it at all)

irm-codebase commented 1 week ago

First comment: I agree. Storing the overrides would be even better for repeatability, and would make debugging easier! I have a design in mind, I will just call theses init_override and build_override.
Second comment: I am... not sure. I am still familiarizing myself with the backend code. I'd say that this is a good solution only as long as it save both data and code to maintain. No preference on my part, but for now I will ignore this one when working on fixes for this. Once the duplicates and additional pointers are removed it will be easier to see how this could be achieved, so at least this issue will help!

brynpickering commented 1 week ago

I would prefer not having _override used as we already use that term a lot. It will create confusion w.r.t. the overrides key in the YAML.

irm-codebase commented 6 days ago

After some discussion between @brynpickering and I, this is the setup I've come up with. It's meant to enable easier debugging.

The most important feature is the breakdown of .config. When passing stuff to the backend, the priority order should be ._runtime > ._config_kwargs > ._config_overrides > ._config_file.

For now, I've added a _ to most things. In reality, I think we do not need it for math, default, data and timestamps. For config, a @property that combines all should be introduced.

Object	Old setup	New setup	Comment	Type	Backend access?	netCDF metadata?
Model()	`.math \\| ._model_data.attrs["math"]`	`._math`		AttrDict	yes	yes
	`._model_data`	`._data`		xarray	yes	yes
	`.config \\| ._model_data.attrs["config"]`	`._config_file`	Not expected to be modified whatsoever.	AttrDict		yes
	`.defaults \\| ._model_data.attrs["defaults"]`	`._defaults`		AttrDict	yes	yes
	`._model_def_path`	`._def_path`		Path
	`._timings \\| ._model_data.attrs["timestamp xxxxx"]`	`._timestamps`	Both do the same.	AttrDict		yes
	`._model_def_dict`		Eliminated. Should be local variable.
		`._config_override`	Saves overrides given during init	AttrDict		yes
		`._config_kwargs`	Saves extra kwargs given during init**, build() and solve()	AttrDict		yes
	`._model_data.attrs["calliope_version_initialized"] \\| ._model_data.attrs["scenario"] \\| ._model_data.attrs["allow_operate_mode"] \\| ._model_data.attrs["applied_additional_math"]`	`._runtime`	Contains extra flags/variables we set during execution.	AttrDict		yes

calliope-project / calliope

Ensuring data and model configuration remain in sync at all times (model, backend, backendmath) #619

What happened?

What is happening?

Why is this a problem?

How to solve this?

Which operating systems have you used?

Version

Relevant log output