Closed Rlamboll closed 4 years ago
Thanks for raising this issue @Rlamboll but I'm not sure which part here you see as a bug and which is bad (or unexpected) design choices - or if this is related to some not-yet-supported aspect of continuous-time data.
My understanding:
nan
if the requested time is outside of the range of a specific timeseries (doing something else would be not inter- but extrapolation)nan
are dropped when initializing a new IamDataFrame
because pandas
behaves strangely with nan
in columns (examples provided @znicholls, see the docs). But calling [timeseries(https://pyam-iamc.readthedocs.io/en/latest/api/iamdataframe.html#pyam.IamDataFrame.timeseries)] fills nan
in any time-column where row data is missing (ie. if at least one data point exists for that time index).Can you specify or provide a short test case that illustrates what you would expect to see?
Perhaps I wasn't clear, all these nans occur in additional columns. Interpolate returns a value correctly in the value
column, but not in other columns (not a bug). However I'd expect that initialising a dataframe only drops cases where important columns are nan, not where any column is nan. If it is going to do that, can we infill those columns with something other than nan. Here is an example (apologies for lack of indentation in this format):
import pandas as pd import pyam
tdb = pd.DataFrame( [["_ma", "_sa", "World", "CO2", "Gt CO2", 0, 2, 3], ["_ma", "_sa", "World", "CH4", "Gt CH4", 0, 200, 300], ], columns=["model", "scenario", "region", "variable", "unit", "sub_annual"] + [2050, 2070], ) df = pyam.IamDataFrame(tdb) df.interpolate(2060) df.data # Contains 2060 data, nan in additional column looks innocuous df.timeseries() # No 2060 appears here pyam.IamDataFrame(df.data).data # Interpolated data has gone
Thanks for the clarification, indeed a bug because the extra_cols
feature is not yet tested thoroughly. PR with a bugfix coming soon.
In terms of how this should work though, should we expect that extra_cols is left out of timeseries? This means that df->timeseries->df loses info. Alternatively, all rows of a model/scenario/variable/region grouping have to have the same additional cols info (and will need to be initialised in this way by interpolate) otherwise they will result in separate rows on the timeseries, with nans shared between rows. Effectively we then have an unspecified (and infinitely extensible) primary key in timeseries format.
good question, let me bounce it back to you:
if there is dataframe like
TEST_DF = pd.DataFrame([
['model_a', 'scen_a', 'World', 'Primary Energy', 'EJ/yr', 'foo', 2005, 1],
['model_a', 'scen_a', 'World', 'Primary Energy', 'EJ/yr', 'bar', 2010, 3],
],
columns=pyam.IAMC_IDX + ['extra_col', 'year', 'value'],
)
i.e, all index values the same except for extra_col
, should there be any interpolation or not?
The bugfix in #352 assumes no (there will not be any interpolation because the first and second line have a different "index" so are different timeseries).
If you think it should yes, how should extra_col
in the new row be filled?
Yes, I would like to see interpolation, i.e. extra columns are ignored for this index. E.g. if I have a meta column with 'infilled' at one timepoint, I'd like to infill around it. I'm mostly treating these cols as extra info, not core function. OTOH, I have no strong preference for what to infill. The annoying answer would be check if the two points around it have identical entries, copy them if so and a non-breaking empty value (0 or "" etc) if not. But I also want this function to work quickly and currently don't have a strong use-case, so depends on how much effort it takes to do that. Always using a non-breaking empty value is fine too.
Ok, I think we agree. Please review #352, where I just added another test based on the example above to ensure that timeseries with non-matching extra-column values are not interpolated.
OK, will there be an update later to make it fail that test?
OK, will there be an update later to make it fail that test?
No, that test shouldn't fail - that's the expected behaviour.
This confuses me because I just said that I did want interpolation in that case.
Ok, then we were talking past each other. Can you please specify which output you expect from interpolate(2007)
on the following timeseries data:
model | scenario | region | variable | unit | extra_col | 2005 | 2010 |
---|---|---|---|---|---|---|---|
model_a | scen_a | World | Primary Energy | EJ/yr | foo | 1 | np.nan |
model_a | scen_a | World | Primary Energy | EJ/yr | bar | np.nan |
3 |
model_a | scen_a | World | Primary Energy | EJ/yr | dummy | np.nan |
5 |
In my understanding, because there are no data points with the same index (including extra_col
) before and after the time to be interpolated, there should not be any data added to this dataframe.
I didn't realise we were planning to allow that sort of data. I thought that (model x scenario x region x variable x time) was a unique key, hence this is an invalid df. Similar to the multi-unit question in #338, it's massively annoying to work out if a model-scenario is "complete" if I need to ask whether an unknown number of extra columns are also populated in every combination. Can we combine data from extra_col = foo with data from extra_col = bar? Without knowing what those columns are, it seems impossible to know. If the extra column is just a label to say 'infilled', clearly it's silly to predicate anything on the basis of that, and I assumed we'd want to combine infilled and non-infilled data into the same row of a timseries. If it somehow changes the meaning (like it's a unit working under a different conversion metric), we need to consider that. I believe we're trying to set up a zoom chat this week on nomenclature for additional columns.
At least we got to the bottom of the misunderstanding...
Yes, this use case was very much the driver for the extra_cols
- think of having a variable Temperature|Global Mean
and then having an additional column climate_model
to specify which climate model generated that timeseries. Rather than what we did previously, having something like Temperature|Global Mean|<Climate Model>
.
What you seem to have in mind is more akin to #287 - tracking which operations modified specific subsections of the data. This should, in my opinion, not be part of the data
dataframe.
I suggest the following way forward:
extra_cols
The merge fixes the most acute of the problems, so go ahead with that. This issue doesn't currently exist in emissions data since these columns are unused. I think the analogy with how things work there would be Temperature (FaIR model)|Global Mean
, since different simulations are effectively different measurement metrics.
closed via #352
When a dataframe has additional data created, e.g. by interpolate, additional columns are often given np.nan as the value, rather than, say, an empty string or a 0.0.. This is presumably a choice rather than a bug. However when there are nans in these columns, they will not appear in timeseries, and will also disappear if the dataframe is recalculated by
pyam.IamDataFrame(database_with_nans.data)
. This seems like a bug.