calliope-project / calliope

A multi-scale energy systems modelling framework
https://www.callio.pe
Apache License 2.0
287 stars 93 forks source link

Allow mappings for tabular data #680

Open irm-codebase opened 1 week ago

irm-codebase commented 1 week ago

What can be improved?

Being able to load tabular data is a wonderful new feature. But it is currently limited by forcing the data to 'match' calliope's dimension names in certain cases. Enforcing strict naming by default makes a lot of sense: you avoid ambiguity and you also avoid the risks of relying on column position.

However, it will often be too inflexible.

Reasoning

Some of our names might lead to files being less human readable, or finicky:

This will lead to a lot of 'boilerplate' code that just shapes the data to fit Calliope's naming. See the following 3 examples for different names used for timeseries in Euro Calliope with v6.10:

image image image

All 3 are equally 'human' readable, but since they do not specify timesteps, they won't load into Calliope.

Proposal

An option to use mappings would solve this issue. For example, you could load one of the timeseries above this way:

data_sources:
  demand_elec_timeseries:
    source: timeseries/demand/electricity.csv
    columns: nodes
    rows: {timesteps: time}
    add_dims:
      techs: demand_elec
      parameters: sink_use_equals

This is still strict, but more flexible.

Version

v0.7.0.dev3

sjpfenninger commented 1 week ago

I think this is a good idea. So the proposal is to allow mappings in this way, right?

rows: {column_name_in_data: dimension_name_in_model}

rather than just

rows: dimension_name_in_model

which requires the dimension name to appear exactly as-is in the data, and should fail in all other cases

irm-codebase commented 1 week ago

Pretty much. If possible, though, I think supporting both is the best case.

Basically, if the type is string, assume its a match. If it's a dict, assume its a mapping?

brynpickering commented 2 days ago

@irm-codebase I'd argue that those different columns actually introduce some ambiguity. How is one to know that the ones not suffixed with _utc are in UTC timezone....? Generally, it makes a lot of sense to follow a standard index name format (e.g., there's a relatively strict set used by the climate community). It's so much easier to maintain tabular data if you follow a standard format.

Still, I'm willing to introduce this mapping for these edge cases.

Having different data types for a single config entry is always a pain to maintain and to document. I would prefer to have something like a mapping key, e.g.:

data_sources:
  demand_elec_timeseries:
    source: timeseries/demand/electricity.csv
    columns: nodes
    rows: timesteps
    map_dims:
      timesteps: time
    add_dims:
      techs: demand_elec
      parameters: sink_use_equals
irm-codebase commented 2 days ago

I agree @brynpickering

Our default should be to assume the data is provided in the correct format. But adding a bit of extra flexibility should avoid extra code in certain cases.

I think your mapping approach is also better than my proposal, so no issues there!