Nans in additional columns

Rlamboll commented 4 years ago

When a dataframe has additional data created, e.g. by interpolate, additional columns are often given np.nan as the value, rather than, say, an empty string or a 0.0.. This is presumably a choice rather than a bug. However when there are nans in these columns, they will not appear in timeseries, and will also disappear if the dataframe is recalculated by pyam.IamDataFrame(database_with_nans.data). This seems like a bug.

danielhuppmann commented 4 years ago

Thanks for raising this issue @Rlamboll but I'm not sure which part here you see as a bug and which is bad (or unexpected) design choices - or if this is related to some not-yet-supported aspect of continuous-time data.

My understanding:

interpolate() (and the underlying fill_series()) return nan if the requested time is outside of the range of a specific timeseries (doing something else would be not inter- but extrapolation)
nan are dropped when initializing a new IamDataFrame because pandas behaves strangely with nan in columns (examples provided @znicholls, see the docs). But calling [timeseries(https://pyam-iamc.readthedocs.io/en/latest/api/iamdataframe.html#pyam.IamDataFrame.timeseries)] fills nan in any time-column where row data is missing (ie. if at least one data point exists for that time index).

Can you specify or provide a short test case that illustrates what you would expect to see?

Rlamboll commented 4 years ago

Perhaps I wasn't clear, all these nans occur in additional columns. Interpolate returns a value correctly in the value column, but not in other columns (not a bug). However I'd expect that initialising a dataframe only drops cases where important columns are nan, not where any column is nan. If it is going to do that, can we infill those columns with something other than nan. Here is an example (apologies for lack of indentation in this format):

import pandas as pd import pyam

tdb = pd.DataFrame( [["_ma", "_sa", "World", "CO2", "Gt CO2", 0, 2, 3], ["_ma", "_sa", "World", "CH4", "Gt CH4", 0, 200, 300], ], columns=["model", "scenario", "region", "variable", "unit", "sub_annual"] + [2050, 2070], ) df = pyam.IamDataFrame(tdb) df.interpolate(2060) df.data # Contains 2060 data, nan in additional column looks innocuous df.timeseries() # No 2060 appears here pyam.IamDataFrame(df.data).data # Interpolated data has gone

danielhuppmann commented 4 years ago

Thanks for the clarification, indeed a bug because the extra_cols feature is not yet tested thoroughly. PR with a bugfix coming soon.

Rlamboll commented 4 years ago

In terms of how this should work though, should we expect that extra_cols is left out of timeseries? This means that df->timeseries->df loses info. Alternatively, all rows of a model/scenario/variable/region grouping have to have the same additional cols info (and will need to be initialised in this way by interpolate) otherwise they will result in separate rows on the timeseries, with nans shared between rows. Effectively we then have an unspecified (and infinitely extensible) primary key in timeseries format.

danielhuppmann commented 4 years ago

good question, let me bounce it back to you:

if there is dataframe like

TEST_DF = pd.DataFrame([
    ['model_a', 'scen_a', 'World', 'Primary Energy', 'EJ/yr', 'foo', 2005, 1],
    ['model_a', 'scen_a', 'World', 'Primary Energy', 'EJ/yr', 'bar', 2010, 3],
],
    columns=pyam.IAMC_IDX + ['extra_col', 'year', 'value'],
)

i.e, all index values the same except for extra_col, should there be any interpolation or not?

The bugfix in #352 assumes no (there will not be any interpolation because the first and second line have a different "index" so are different timeseries).

If you think it should yes, how should extra_col in the new row be filled?

Rlamboll commented 4 years ago

Yes, I would like to see interpolation, i.e. extra columns are ignored for this index. E.g. if I have a meta column with 'infilled' at one timepoint, I'd like to infill around it. I'm mostly treating these cols as extra info, not core function. OTOH, I have no strong preference for what to infill. The annoying answer would be check if the two points around it have identical entries, copy them if so and a non-breaking empty value (0 or "" etc) if not. But I also want this function to work quickly and currently don't have a strong use-case, so depends on how much effort it takes to do that. Always using a non-breaking empty value is fine too.

danielhuppmann commented 4 years ago

Ok, I think we agree. Please review #352, where I just added another test based on the example above to ensure that timeseries with non-matching extra-column values are not interpolated.

Rlamboll commented 4 years ago

OK, will there be an update later to make it fail that test?

danielhuppmann commented 4 years ago

OK, will there be an update later to make it fail that test?

No, that test shouldn't fail - that's the expected behaviour.

Rlamboll commented 4 years ago

This confuses me because I just said that I did want interpolation in that case.

danielhuppmann commented 4 years ago

Ok, then we were talking past each other. Can you please specify which output you expect from interpolate(2007) on the following timeseries data:

model	scenario	region	variable	unit	extra_col	2005	2010
model_a	scen_a	World	Primary Energy	EJ/yr	foo	1	`np.nan`
model_a	scen_a	World	Primary Energy	EJ/yr	bar	`np.nan`	3
model_a	scen_a	World	Primary Energy	EJ/yr	dummy	`np.nan`	5

In my understanding, because there are no data points with the same index (including extra_col) before and after the time to be interpolated, there should not be any data added to this dataframe.

Rlamboll commented 4 years ago

I didn't realise we were planning to allow that sort of data. I thought that (model x scenario x region x variable x time) was a unique key, hence this is an invalid df. Similar to the multi-unit question in #338, it's massively annoying to work out if a model-scenario is "complete" if I need to ask whether an unknown number of extra columns are also populated in every combination. Can we combine data from extra_col = foo with data from extra_col = bar? Without knowing what those columns are, it seems impossible to know. If the extra column is just a label to say 'infilled', clearly it's silly to predicate anything on the basis of that, and I assumed we'd want to combine infilled and non-infilled data into the same row of a timseries. If it somehow changes the meaning (like it's a unit working under a different conversion metric), we need to consider that. I believe we're trying to set up a zoom chat this week on nomenclature for additional columns.

danielhuppmann commented 4 years ago

At least we got to the bottom of the misunderstanding...

Yes, this use case was very much the driver for the extra_cols - think of having a variable Temperature|Global Mean and then having an additional column climate_model to specify which climate model generated that timeseries. Rather than what we did previously, having something like Temperature|Global Mean|<Climate Model>.

What you seem to have in mind is more akin to #287 - tracking which operations modified specific subsections of the data. This should, in my opinion, not be part of the data dataframe.

I suggest the following way forward:

(with your permission) merge PR #352 because it still fixes a bug identified by this conversation and enables the original use of extra_cols
have a conversation what is strictly necessary vs. what might be useful for AR6 (with @byersiiasa, @znicholls and others)
start a new issue here or in a related repository on how to handle comments, annotations and workflow tracking

Rlamboll commented 4 years ago

The merge fixes the most acute of the problems, so go ahead with that. This issue doesn't currently exist in emissions data since these columns are unused. I think the analogy with how things work there would be Temperature (FaIR model)|Global Mean, since different simulations are effectively different measurement metrics.

danielhuppmann commented 4 years ago

closed via #352

IAMconsortium / pyam

Nans in additional columns #351