How to aggregate by level?

stefaneidelloth commented 2 years ago

Lets assume I have some data with four variable levels:

df = pyam.IamDataFrame(pd.DataFrame([
    ['IMG', 'a_scen', 'World', 'Emissions|CO2|Energy|Oil', 'Mt CO2/yr', 2, 3.2, 2.0, 1.8],
    ['IMG', 'a_scen', 'World', 'Emissions|CO2|Energy|Gas', 'Mt CO2/yr', 1.3, 1.6, 1.0, 0.7],
    ['IMG', 'a_scen', 'World', 'Emissions|CO2|Energy|BECCS', 'Mt CO2/yr', 0.0, 0.4, -0.4, 0.3],
    ['IMG', 'a_scen', 'World', 'Emissions|CO2|Foo|Oil', 'Mt CO2/yr', 2, 3.2, 2.0, 1.8],
    ['IMG', 'a_scen', 'World', 'Emissions|CO2|Foo|Gas', 'Mt CO2/yr', 1.3, 1.6, 1.0, 0.7],
    ['IMG', 'a_scen', 'World', 'Emissions|CO2|Foo|BECCS', 'Mt CO2/yr', 0.0, 0.4, -0.4, 0.3],
    ['IMG', 'a_scen', 'World', 'Emissions|CO2|Cars', 'Mt CO2/yr', 1.6, 3.8, 3.0, 2.5],
    ['IMG', 'a_scen', 'World', 'Emissions|CO2|Tar', 'Mt CO2/yr', 0.3, 0.35, 0.35, 0.33],
    ['IMG', 'a_scen', 'World', 'Emissions|CO2|Agg', 'Mt CO2/yr', 0.5, -0.1, -0.5, -0.7],
    ['IMG', 'a_scen', 'World', 'Emissions|CO2|LUC', 'Mt CO2/yr', -0.3, -0.6, -1.2, -1.0]
    ],
    columns=['model', 'scenario', 'region', 'variable', 'unit', 2005, 2010, 2015, 2020],
))
df.timeseries()

How can I aggregate the last level (level=3)?

The result should then contain following variables:

'Emissions|CO2|Energy|'
'Emissions|CO2|Foo|'
'Emissions|CO2|Cars'
'Emissions|CO2|Tar'
'Emissions|CO2|Agg'
'Emissions|CO2|LUC'

And If I would aggregate to level 1 only

'Emissions|CO2'

would remain.

danielhuppmann commented 2 years ago

The answer to the first part of the question is

df.aggregate("Emissions|CO2|Energy")

There is also a tutorial on this, see https://pyam-iamc.readthedocs.io/en/stable/tutorials/aggregating_downscaling_consistency.html

You can also specify specific components, and you can use append=True to directly aggregated data to df. There is also a recursive argument (though this has limitations and only works with summation, see the docs).

stefaneidelloth commented 2 years ago

Thank you for the quick reply. My example data was not good enough and I adapted it to include more variables. I do not want to explicitly define the variables that should be aggregated but that variables should be determined by their level.

df.aggregate(level=2)

a) If there is no direct way, a possible work around might be to first determine a list of variables on a distinct level and then pass it as array:

level_1_variables = determine_variables_for_level(1)
df.aggregate(level_1_variables)

b) Specifying components seems only useful for the use case where I would like to aggregate some explicit list of variables and give a new name for the result.

c) Another strategy might be to first aggregate/group the pandas dataframe before converting it to pyam.

=> Is there already a method to aggregate by level or would I have to implement it on my own?

danielhuppmann commented 2 years ago

Got it, maybe something like the following:

var_list = df.filter(level=x).variable
df.aggregate(var_list)

Or if performance is critical (or you are working with a large dataset where you don't want to create a large copy)...

var_list = [v for v in df.variable if pyam.find_depth(v, level=x)]
df.aggregate(var_list)

See the docs of the variable-string utils here.

stefaneidelloth commented 2 years ago

The above code would require that the variables are already explicitly mentioned on that level. I don't have an entry 'Emissions|CO2|Energy' in my original example data.

However, it includes some other entries that are already aggregated, e. g. 'Emissions|CO2|Cars'.

Following script seems to work:

level=2
full_var_list = list(set([pyam.reduce_hierarchy(v, level) for v in df.variable if pyam.find_depth(v, level=str(level) + '+')]))
already_aggregated_var_list = [v for v in df.variable if pyam.find_depth(v, level=level)]
var_list_to_aggregate = list(set(full_var_list) - set(already_aggregated_var_list ))

already_aggregated_df = df.filter(variable=already_aggregated_var_list )

df.filter(variable=already_aggregated_var_list, keep=False)\
.aggregate(variable=var_list_to_aggregate)\
.append(already_aggregated_df)\
.timeseries()

If there is a more elegant way to do this, please let me know.

danielhuppmann commented 2 years ago

Looks correct, though you may want to look at the recursive-aggregation-option again, which will be more performant in large datasets because it operates directly on the internal pandas.Series _data object.

In the example above, using the following would work:

df.aggregate("Emissions|CO2", recursive=True, append=True)

In general, I would caution against too much automatization of your workflow. There may be variables in your dataset where simple summation is not appropriate, eg efficiency rates or prices. Even if you don't report these data now, you may add them later and forget that these need to be treated differently.

It may be easier and safer in the long run to determine the top-level-variables via inspection "by hand". Our brand-new nomenclature-iamc package is intended to manage lists of variables, see https://nomenclature-iamc.readthedocs.io/.

stefaneidelloth commented 2 years ago

Thank you for your advice. The design of the variable structure is still in progress and I hope that our colleagues force it to be strictly hierarchical and design it in a way that allows easy aggregation and validation. However, maybe that is indeed unrealistic and having some custom column "aggregation_mode" or some external kind of variable manager / variable classification would be helpful.

danielhuppmann commented 2 years ago

In the nomenclature package, we defined a syntax for a "variable manager" with a list of nested dictionaries, where the key is the variable name and the value (of the outer dictionary) is again a dictionary, where the attribute skip-region-aggregation: true indicates to skip region-aggregation as part of the processing.

See this unit-test-data here for an example.

Other attributes of the dictionary can be passed to the pyam-aggregate_region-method, see this unit-test-data here.

[We are still working on a full-fledged documentation...]

IAMconsortium / pyam

How to aggregate by level? #594