IAMconsortium / pyam

Analysis & visualization of energy & climate scenarios
https://pyam-iamc.readthedocs.io/
Apache License 2.0
226 stars 118 forks source link

Integrated handling of categorical vs. datetime time formats #596

Closed danielhuppmann closed 2 years ago

danielhuppmann commented 2 years ago

Background

The current implementation of pyam allows to use three different cases of temporal resolution:

  1. yearly data (or longer timeframes, e.g., decadal intervals), indicated as df.time_col = "year"
  2. yearly data with categorical subannual resolution, indicated as df.time_col = "year" and an extra-column "subannual". Values in that column could be "summer", "winter", or an abbreviated datetime format like mm-dd hh:mmz (see here)
  3. subannual resolution in a datetime format, df.time_col = "time"

It is possible to mix use cases 1 and 2 (using a subannual column with a value "year"), but pyam currently does not support mixing 1/2 and 3 within one IamDataFrame.

This is because the long-format pd.Series _data (which holds the timeseries values) either has an index dimension "year" (as integer) or "time" (as datetime).

Way forward

Going forward, applications and use cases are emerging where an IamDataFrame should support all use cases, for example having hourly data (power generation) as well as yearly values (carbon prices, installed capacity).

  1. ~Split out the _data into two (or three?) distinct objects, one per use case (and refactor all methods that use it)~
  2. ~Adapt the _data object by having two index dimensions ("year" and "time"), where for each row, only exactly one of year or time is not null~
  3. Adapt the _data object by making the class of the entry in the "time" column specify the type of value (if integer, the value is a year; if datetime, it is subannual continuous resolution)
  4. Like 3, but have a separate index dimension (column in a dataframe sense) that specifies the type (might be more reliable or have higher performance than type-checking-by-row?)
  5. Force all entries in the time column to follow a datetime format Caveat:
    • Would probably not support data in a subannual non-continuous-time-format (e.g., "winter-night")
    • Difficult to know if a value for "2020-01-01 00:00" is intended for the first hour of a year or the full year
  6. Have a wrapper class to hold two IamDataFrames, one for each type (added per suggestion by @znicholls below)

Related questions

This issue is (for the time being) only intended for the internal handling - related (subsequent) discussions have to touch the i/o-aspects (reading from & writing to csv/xlsx, how to display the output of the timeseries() method)...

danielhuppmann commented 2 years ago

@gidden @znicholls @Rlamboll, any thoughts? FYI @phackstock @meksor

Rlamboll commented 2 years ago

I don't honestly see a good reason to ever allow text-denoted seasons/times of day in use-case 2; if you want to do seasonal things, use a seasonal datetime. That way all plots will look nice and we won't have to use two different columns for sorting things (sort by combination of int-year and text will be horrible!). I can see that it may be useful to offer something that helps convert season info or two-cell date/time specification to an internal datetime though.

Looking at your suggestions, I strongly dislike 1 and 2 as they will essentially break everything and be very painful to code around. Having inconsistent types in a column is also generally awkward and presumably hurts if you want to do data compression. I'd therefore tend towards 5 for cases where any of the input time data is not an integer. I'd have a preference for still allowing ints where only years are used, which is still the overwhelming usecase I see.

Resolving your remaining problem with this ("Difficult to know if a value for "2020-01-01 00:00" is intended for the first hour of a year or the full year"), maybe we can put something in the metadata to define a variable as either constant, continuously interpolated or yearly interpolated. Let's say my df has yearly carbon price info and hourly power info. I put the carbon price as the datetime at the start of the year. In the majority of cases (e.g. when plotting dataseries separately), I won't actually need to compare them, so don't want to waste time specifying that I mean this applies to the whole year, but if I want to convert it all to a timeseries, I can specify that a variable in the dataframe should be understood as using the most recent value before the required time or interpolating between the neighbouring values. This will have to be consistent between model/scenarios. This would be a generally useful feature and I don't see that any of your other suggestions have a way to solve solve this timeseries interpolation problem, other than assuming we always want one or the other interpolation style or to throw away missing timepoints.

znicholls commented 2 years ago

A super nasty issue. @lewisjared and I have just been having a related conversation in scmdata (https://github.com/openscm/scmdata/pull/183).

In my experience, supporting all use cases is really, really difficult. If you can assume everything is integer (i.e. years), then many things are simpler. Similarly, if you know you're only dealing with datetimes (like scmdata), then you can make a certain set of assumptions. If you're having to support both at once, the complexity really increases (probably more than doubles).

IamDataFrame should support all use cases

I would push back on this. At the moment the number of users who need hourly and yearly data in the same IamDataFrame is probably small (especially as a percentage of the user base). Catering to them rather than the majority of users seems to be prioritising the wrong group.

My suggestion would be to create a subclass of IamDataFrame where you could explore supporting this use case without risking destroying the performance of everyone else/making all other additions suddenly have to wrap their head around multiple time options.

An alternate option would be to create a wrapper class, which does something like your 1) but by holding two IamDataFrames rather than refactoring the internals of IamDataFrame completely. That wrapper class could then apply filters in the right places, join timeseries, make plots etc. without forcing a complete rethink of our internals.

My thoughts on the options laid out:

Like @Rlamboll, 1) and 2) seem super painful to me and may also come with massive performance penalties. I think 3) and 4) will also have big performance issues (although pyam already has well known performance limits so maybe we don't care). 5) is the route we went with scmdata for our own sanity (also makes interoperability with xarray super easy). The issue you raise is a tricky one and we haven't come up with a great solution. We thought about carrying around an extra column but haven't made that leap yet (the distinction we were thinking of making was between piecewise linear and piecewise constant data although you can quickly disappear down a rabbit hole of thinking up cases where you need piecewise quadratic etc.). I agree with @Rlamboll though that this issue is an issue for the current implementation (if I just have a year, do I mean start of year, end of year, middle of year, average over the entire year?) and all the other proposals too so I don't think it's much of a drawback.

danielhuppmann commented 2 years ago

Thanks @Rlamboll & @znicholls for your detailed thoughts. I added @znicholls' idea of a wrapper class to the list in the issue description. Point well taken that suggestions 1 & 2 are off the table for compatibility with current work.

Re @Rlamboll's specific points:

I don't honestly see a good reason to ever allow text-denoted seasons/times of day in use-case 2

Having "representative timeslices" like ['summer-day', 'winter-day', 'summer-night', 'winter-night'] is a very common use case in energy systems models. And these slices often have different lengths. AFAIK, these cannot be represented by seasonal datetime.

... offer something that helps convert season info or two-cell date/time specification to an internal datetime though.

We already have methods swap_time_for_year() and swap_year_for_time() to easily convert between use cases 2 and 3.

... put something in the metadata to define a variable as either constant, continuously interpolated or yearly interpolated

Not sure how this would work in terms of intuitive user experience and data i/o, will think about it.

Re @znicholls' specific points:

In my experience, supporting all use cases is really, really difficult.

True, but we see that as an emerging use case in several projects - and the entire Scenario Explorer infrastructure (in particular scenario submission and validation) currently builds on the assumption that uploads are pyam-parseable files. So either we adapt pyam or the database infrastructure...

znicholls commented 2 years ago

True, but we see that as an emerging use case in several projects - and the entire Scenario Explorer infrastructure (in particular scenario submission and validation) currently builds on the assumption that uploads are pyam-parseable files. So either we adapt pyam or the database infrastructure

If you have to do it, you have to do it :)

Rlamboll commented 2 years ago

Having "representative timeslices" like ['summer-day', 'winter-day', 'summer-night', 'winter-night'] is a very common use case in energy systems models. And these slices often have different lengths. AFAIK, these cannot be represented by seasonal datetime.

OK, I see what you mean by this, but literally none of these suggestions resolve this use-case because the day/night division does not behave like a time - they are not orderable, and you could never plot them in the same line on a graph. I think the correct way to resolve this is to leave the day/night division either as a qualitative column or as part of the variable name, since the day/night values are qualitatively different. The seasonal info can be forced into the datetime as above, so you can plot the two sets of values and expect the correct ordering of the points.

danielhuppmann commented 2 years ago

Re @Rlamboll:

literally none of these suggestions resolve this use-case

The current implementation with an extra-column "subannual" holding names of categorical timeslices works well enough for most use cases: data i/o, filtering, algrebraic operations, aggregation, plotting (for a line-plot, you'll get one line per timeslice, with years on the x-axis). See this tutorial.

Rlamboll commented 2 years ago

The current implementation with an extra-column "subannual" holding names of categorical timeslices works well enough for most use cases: data i/o, filtering, algrebraic operations, aggregation, plotting (for a line-plot, you'll get one line per timeslice, with years on the x-axis). See this tutorial.

The current setup indeed works well enough if you don't need to treat categorical timeslices as times-proper, but "do nothing" wasn't an option in your list :)