decide if/how to fill in missing columns in constructor

gidden commented 5 years ago

During PR #199 we had a use case that became unsupported in the final implementation, notably filling in "missing" values in expected columns

For example, a dataframe looking like

scenario    year    Population  GDP     Urbanization
0   SSP1    2010    6.868687e+09    7.641454e+13    0.516281
1   SSP1    2015    7.210848e+09    9.249094e+13    0.546193
2   SSP1    2020    7.517782e+09    1.144206e+14    0.584815
3   SSP1    2025    7.782887e+09    1.409554e+14    0.621583
4   SSP1    2030    7.999304e+09    1.725584e+14    0.656344

At the moment raises an error:

y = pyam.IamDataFrame(df, value=['Population', 'GDP', 'Urbanization'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-d2c631ea80be> in <module>()
----> 1 y = pyam.IamDataFrame(df, value=['Population', 'GDP', 'Urbanization'])

~/.local/lib/python3.5/site-packages/pyam_iamc-0.1.2+44.g381c4f6-py3.5.egg/pyam/core.py in __init__(self, data, **kwargs)
     69         # import data from pd.DataFrame or read from source
     70         if isinstance(data, pd.DataFrame) or isinstance(data, pd.Series):
---> 71             _data = format_data(data.copy(), **kwargs)
     72         elif has_ix and isinstance(data, ixmp.TimeSeries):
     73             _data = read_ix(data, **kwargs)

~/.local/lib/python3.5/site-packages/pyam_iamc-0.1.2+44.g381c4f6-py3.5.egg/pyam/utils.py in format_data(df, **kwargs)
    188     if not set(IAMC_IDX).issubset(set(df.columns)):
    189         missing = list(set(IAMC_IDX) - set(df.columns))
--> 190         raise ValueError("missing required columns `{}`!".format(missing))
    191 
    192     # check whether data in wide format (IAMC) or long format (`value` column)

ValueError: missing required columns `['model', 'unit', 'region']`!

At some point in the PR, default values would be filled in for these three columns (just with their column names) for ease of use. In many cases, I find that I don't actually care what these values are, and in fact just want the mountain of other nice pyam utilities to work with my data.

So the question is: should we force users to fill in these, e.g.,

y = pyam.IamDataFrame(df, value=['Population', 'GDP', 'Urbanization'], model='foo', region='bar', unit='baz')

or should we do that for them with column names or some other value?

gidden commented 5 years ago

cc @danielhuppmann @znicholls

znicholls commented 5 years ago

Tricky one, I'm not sure. I've tried doing auto-filling using None in OpenSCM and it hasn't been happy so that solution, whilst ideal, might be a bit hairy to make behave (pandas can be temperamental with None and nan values). The plan 'b' of filling with the column name seems like an ok fall back with plan 'c' just being to force users to fill in.

danielhuppmann commented 5 years ago

I agree that all required columns other than variable can default to None (not sure how I feel about variable=None).

Need to check whether the „check for duplicates“ part at the end of format_data() continues to work as expected.

danielhuppmann commented 5 years ago

Update following comment by @znicholls:

If pandas behaves weird with None in columns, forcing users to provide names might be preferable.

danielhuppmann commented 5 years ago

One more thought about None in columns: how do expect behaviour if we append an IamDataFrame with model=None to a “regular” frame? df.filter(model=None) will not work (I think) and will also conflict with suggested changes in #207.

znicholls commented 5 years ago

hmmm ok so maybe None is a bad idea. nan could work but it also creates plenty of havoc with pandas (and wouldn't work with the current drop_duplicate call in format_data).

danielhuppmann commented 4 years ago

This issue has been resolved in the sense that the constructor now takes keyword arguments with a default value for columns that are not in the input dataframe as suggested above:

y = pyam.IamDataFrame(df, value=['Population', 'GDP', 'Urbanization'], model='foo', region='bar', unit='baz')

IAMconsortium / pyam

decide if/how to fill in missing columns in constructor #208