IAMconsortium / pyam

Analysis & visualization of energy & climate scenarios
https://pyam-iamc.readthedocs.io/
Apache License 2.0
234 stars 120 forks source link

Check output consistency of `.validate` and similar #428

Closed byersiiasa closed 4 years ago

byersiiasa commented 4 years ago

Came across an issue with the newest version where an output of .validate(), which I think was previously a pd.DataFrame is now a pd.Series. Not a problem in itself - but scripts that then take this output and perform operations that only work with DataFrame will throw an error. - So backward compatibility is reduced.

In our case it was .drop(..., axis=1) (i.e. drop columns), which is not possible on pd.Series

Error given is: ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'>

danielhuppmann commented 4 years ago

Thanks @byersiiasa for identifying this issue!

Let me flip this back to you (also pinging @kvanderwijst @gidden @jkikstra): what would be the "natural" return type of this function?

For the aggregation features, we recently switched all returned objects to an IamDataFrame - would that make sense here as well? Also considering that there is now a info() function (see #427) which is shown by default in Jupyter notebooks...

gidden commented 4 years ago

If the return type is always guaranteed to be 1-dimensional, then I would say pd.Series in order to stay in the pandas-verse. Assumptions here based on the docstring:

Returns all scenarios which do not match the criteria and prints a log message, or returns None if all scenarios match the criteria.

danielhuppmann commented 4 years ago

If staying in the pandas-verse is preferred, I'd rather revert the behavior to a DataFrame rather than a Series. From my own experience, I usually use this in a notebook context where the DataFrame is quicker to parse (visually) than a Series with a complex MultiIndex...

byersiiasa commented 4 years ago

I would probably favour either type DF over Series - Series almost always give me headaches

jkikstra commented 4 years ago

I also prefer pandas.dataframe. I find myself using .reset_index() in my workflows basically everytime I "happen to find" a MultiIndex coming from a pyam function.

danielhuppmann commented 4 years ago

Thanks @jkikstra - good that I implemented it that way yesterday evening... 😜 Added you as a reviewer.

Please create a new issue for any other functions or methods where a non-intuitive or impractical type is returned.