Letting users define parameters and additional dimensions in YAML files

irm-codebase commented 1 month ago

What can be improved?

Opening this issue after discussions with @sjpfenninger and @brynpickering

Currently, the math data and schemas are a bit 'mixed'. In particular, model_def_schema.yaml contains parameter defaults, which makes looking for them difficult.

Similarly, the current parsing of yaml files is a bit difficult in where/how new dimensions and parameters are declared.

For parameters:

Declare current parameters in base.yaml instead.
Make users explicitly declare new parameters in yaml files, with at least the default value. We can just add a new params section for it.
Parameters must at least have "default" declared (with some additional logic to convert 0 to null, or warning users of it).

For dimensions:

Make users declare new dimensions in yaml files. Possible important settings are "ordered" / "numeric" for stuff like years and timesteps were the distance between them might matter.
Ditto for groups (see #604 for discussion).

The idea is to make model definition files less ambiguous. We probably should also think on how this affects schema evaluation, and some of the parsing.

Version

v0.7

brynpickering commented 4 weeks ago

I've had a look at this and think it could work pretty well most of the time. However, there is the issue that config for parameters is then only analysed on calling calliope.Model.build. This is problematic for time resampling since one of the options is about how to resample a given parameter (taking the average or the sum).

One thing we could do is resample the data on-the-fly when calling calliope.Model.build (something I considered in the past), so the timeseries isn't resampled at instatiation. This would probably work fine but would come with the disadvantage that either results have to be in their own xarray dataset (with a different timesteps dimension) or the results are returned with a sparse timesteps dimension (e.g. every other timestep has a value with a 2h resolution).

If we continue with resampling at instatiation, there needs to be a way of defining parameter config that is separate from the math - does this get messy?

irm-codebase commented 4 weeks ago

I think it's better to go with the option that is (long term) easier to manage, which would be resampling during build.

Functionality-wise, it makes sense: you are telling the model to build/prepare itself. If need be, we can split backend steps and just wrap around them:

Model init
Model build a. build.math b. build.timeseries c. build.whatever
Model solve...

For now, let us assume that we control the build process (i.e., model.build will run everything). If we want users to be able to build specifics, we'll need an attribute that lets us know the state machine's status.

brynpickering commented 4 weeks ago

I agree that it is easier to manage on our side, but how about the data structures that come out of the other end? If you have hourly input data and then resample to monthly data, you'll get a timeseries of 8760 elements on the inputs and only 12 elements on the outputs... Should resampled inputs be available somewhere? If yes, then we risk bloating the model if we resample to close to the input resolution (e.g. 2hrs). If no, visualising input and output timeseries data will be a pain.

irm-codebase commented 4 weeks ago

That's where the separation of post-processing (https://github.com/calliope-project/calliope/issues/638) and non-conflicting configurations (https://github.com/calliope-project/calliope/issues/626) come in!

If done right, the selected "mode" might activate some post-processing step that makes data cleaner (in the case of re-sampling, you can activate/deactivate a "de-sampler"?). Although to be honest I do not see this as an issue in the case of re-sampling... you request monthly data, you get monthly data. Otherwise we'd bloat the model output unnecessarily...

Also, post-processing stuff should only support "official" modes. If users define their own math, is up to them to post-treat it (like any other piece of software, really).

brynpickering commented 4 weeks ago

I think this is separate to either of those. It's instead about the storage of data indexed over the time dimension. We would either need to split the inputs and results into two separate xarray datasets with different length timesteps or have them in one dataset with lots of empty data. It's perhaps more linked to #634.

If two separate datasets, on saving to file we'd merge those two and still end up with sparse data in the time dimension.

irm-codebase commented 4 weeks ago

@brynpickering just to confirm: sparse data in this case would mean that xarray will not fill in those spaces with nan?

Because otherwise we have an issue in our hands, since nan is not sparse.

brynpickering commented 3 weeks ago

It will fill in those data with NaN. I mean sparse in terms of it being a sparse matrix (more empty than non-empty entries) with NaN being used to describe empty entries.

irm-codebase commented 3 weeks ago

In that case, I would avoid doing this unless we guarantee data sparsity through sparse (or something similar), because each of those nan values will take 64 bits (the same as a float). Given the size of our models, this will result in very large datasets that are mostly empty.

Keeping inputs and results separate is better since it saves us from significant bloat, data wise, I think...

calliope-project / calliope

Letting users define parameters and additional dimensions in YAML files #642

What can be improved?

Version