IAMconsortium / pyam

Analysis & visualization of energy & climate scenarios
https://pyam-iamc.readthedocs.io/
Apache License 2.0
227 stars 118 forks source link

Helpful accessors confounded by pandas regression #762

Closed znicholls closed 1 year ago

znicholls commented 1 year ago

The test below fails and I can't see why. This is the underlying cause of https://github.com/iiasa/climate-assessment/pull/36 I think.

In short: if you subset an IamDataFrame and then create new instances within some loop (probably a bad pattern, but let's ignore that for now), the metadata is still based on the original data rather than the subset. In the example, below this means that it looks like a scenario provides a variable when it actually doesn't.

@phackstock cc @danielhuppmann in case it helps with your search for a cause and making sure the bug doesn't come back.

def test_climate_assessment_bug():
    test_df = pd.DataFrame(
        [
            ["model_a", "scen_a", "World", "Primary Energy", "EJ/yr", 1, 6.0],
            ["model_a", "scen_a", "World", "Primary Energy|Coal", "EJ/yr", 0.5, 3],
            ["model_a", "scen_b", "World", "Primary Energy", "EJ/yr", 2, 7],
        ],
        columns=["model", "scenario", "region", "variable", "unit", 2020, 2030],
    )
    test_df = IamDataFrame(test_df)
    for (model, scen), df_scen in test_df.timeseries().groupby(["model", "scenario"]):
        if not (model == "model_a" and scen == "scen_b"):
            continue

        # Primary Energy|Coal isn't part of the dataset from which `df_scen_pyam` is initialised
        assert "Primary Energy|Coal" not in df_scen.index.get_level_values("variable").unique()

        df_scen_pyam = IamDataFrame(df_scen)
        # Yet Primary Energy|Coal appears in the model_a, scen_b subset's `.variable` attribute
        assert "Primary Energy|Coal" not in df_scen_pyam.variable

        # I think this is a Pandas bug
danielhuppmann commented 1 year ago

Oh my got, this is such a great find!

So... The issue is that when you do a pandas-groupby, pandas does not actually remove unused levels from df_scen.

If you continue with the example above:

df_scen.index.levels[3]
> Index(['Primary Energy', 'Primary Energy|Coal'], dtype='object', name='variable')

which is exactly what the pyam-accessors use.

It's not really a bug in pandas, I guess, more a performance-optimizing feature - only drop unused levels if necessary.

danielhuppmann commented 1 year ago

Also, this explains why #731 introduced this behavior - it removed the (performance-drag of) resetting the index twice, which probably inadvertently removed unused levels...

znicholls commented 1 year ago

Very nice explanation, will review #763 now