Confirm expected behaviour of aggregate with nested hierarchy

willu47 commented 1 year ago

Hi, following a question I asked in the openmod session today, please could you confirm the expected behaviour of the .aggregate function when presented with missing levels in the data hierarchy. For example, the following test fails because the two coal sub-categories Primary Energy|Fossil|Coal|Lignite and Primary Energy|Fossil|Coal|Brown are ignored.

Am I missing something?

import pandas as pd
from pyam import IamDataFrame, IAMC_IDX

LONG_IDX = IAMC_IDX + ["year"]

PRICE_NESTED_DF = pd.DataFrame(
    [
        ["model_a", "scen_a", "World", "Primary Energy|Fossil|Coal|Lignite", "EJ/yr", 2010, 10.0],
        ["model_a", "scen_a", "World", "Primary Energy|Fossil|Coal|Brown", "EJ/yr", 2010, 30.0],
        ["model_a", "scen_a", "World", "Primary Energy|Fossil|Gas", "EJ/yr", 2010, 45.0],
    ],
    columns=LONG_IDX + ["value"],
)

def test_nested_aggregate():
    actual = IamDataFrame(PRICE_NESTED_DF).aggregate(variable='Primary Energy|Fossil').data
    data = [
        ["model_a", "scen_a", "World", "Primary Energy|Fossil", "EJ/yr", 2010, 85.0]
    ]
    expected = pd.DataFrame(data, columns=LONG_IDX + ["value"])
    print(actual)
    print(expected)
    assert pd.testing.assert_frame_equal(actual, expected)

willu47 commented 1 year ago

If I add recursive=True argument to the aggregate() function I get this:

     model scenario region                    variable   unit  year  value
0  model_a   scen_a  World       Primary Energy|Fossil  EJ/yr  2010   85.0
1  model_a   scen_a  World  Primary Energy|Fossil|Coal  EJ/yr  2010   40.0

willu47 commented 1 year ago

(
    IamDataFrame(PRICE_NESTED_DF)
    .aggregate(variable='Primary Energy|Fossil', recursive=True)
    .aggregate(variable='Primary Energy|Fossil')
)

returns

     model scenario region               variable   unit  year  value
0  model_a   scen_a  World  Primary Energy|Fossil  EJ/yr  2010   40.0

danielhuppmann commented 1 year ago

Thanks @willu47 - indeed, I'd say that this is behaving as expected.

If aggregate() is called without further arguments, it will uses all variables that are directly below variable (equivalent to filter(variable=f"{variable}|*"), level=0, see this utility method)
If components is given explicitly, this will be used instead
If recursive=True, pyam will work its way up the variable tree up to variable and only return the computed data
If append=True, it will append the computed data to the object and return None. Maybe this is what you're looking for?

Question back to you: which other behavior would you find more intuitive? Or how could we improve the docs?

Sidenote:
df.aggregate("<variable>", append=True)
has the same behavior as
df.append(df.aggregate("<variable>"))
but the first option has better performance.

danielhuppmann commented 1 year ago

And FYI: pyam has a testing module with a function pyam.testing.assert_iamframe_equal, see the docs - this is maybe more appropriate for your use case because you don't have to worry about the order of the columns and rows (and it operates on an indexed pd.Series, so it's faster).

IAMconsortium / pyam

Confirm expected behaviour of aggregate with nested hierarchy #737