IAMconsortium / pyam

Analysis & visualization of energy & climate scenarios
https://pyam-iamc.readthedocs.io/
Apache License 2.0
220 stars 115 forks source link

Bulk aggregate #355

Open znicholls opened 4 years ago

znicholls commented 4 years ago

In https://github.com/znicholls/silicone/pull/72, @Rlamboll has added a feature to do bulk aggregation within an IamDataFrame. This is a convenience function, but means you could quickly specify to calculate a bunch of aggregate variables (some of which are more complex than just a pure sum) without writing a custom wrapper every time (current implementation here, work in progress though).

@danielhuppmann Does this feature already exist? If no, is it something you'd be interested in bringing across?

danielhuppmann commented 4 years ago

Thanks @znicholls for the cross-reference. This is similar to the recent improvements of aggregate() (#305 & #312). This now supports a method arg (min, max, weighted sum) and doing "bulk" aggregation by passing a list of variables to be aggregated. Similar for aggregate_region(), the related check_*() functions and the downscale_region() function.

The current implementation in pyam, though, only works with a list of variables when you do the "obvious" aggregation, i.e., all subcategories of each variable - whereas the implementation in silicone takes a dict(variable=components). This would be a useful addition to pyam, I think.

As for the second feature, I have some concerns that this would be overloading the function. If the multiplication relates to conversion to CO2-equivalents, I'd rather use separate function for conversion and aggregation, for example:

_df = df.convert_unit('Mt CH4', 'Mt CO2e', context='gap_AR5GWP100')
_df.aggregate('Kyoto GHG', components=['Emissions|CO2', 'Emissions|CH4'])

Or the first step could use the dataframe-operations feature (work in progress by @gidden, see #333)...

Rlamboll commented 4 years ago

The main use of the factors is to do subtraction with -1, e.g. aggregate = "Emissions|CO2|Other" other_CO2 = mi.infill_composite_values( sr15_data, { aggregate: { "Emissions|CO2": 1, "Emissions|CO2|Energy and Industrial Processes": -1, "Emissions|CO2|AFOLU": -1, } }, ) I just allow any multiple in case people want to do other things like weightings. It could be restricted to a sign if you want.

znicholls commented 4 years ago

Yep I completely agree with all of that. I wasn’t actually thinking of altering aggregate directly, rather creating a new method or utility function. That method or utility would wrap the operations done by @gidden and/or aggregate. Does that a sensible way forward to you?

danielhuppmann commented 4 years ago

I'm not quite sure what you have in mind and whether it's worth the additional maintenance overhead if it's just a wrapper for two or three existing functions.

Can you specify the suggested function name and the API (kwargs and returned object)?

znicholls commented 4 years ago

whether it's worth the additional maintenance overhead if it's just a wrapper for two or three existing functions

I'm also not sure.

My current thoughts (@Rlamboll may have others)

def bulk_aggregate(iamdf, aggregates):
    """
    Aggregate variables within a :obj:`pyam.IamDataFrame`

    This convenience function allows a number of aggregate variables to be
    calculated from the data within a :obj:`pyam.IamDataFrame`. The
    aggregation is flexible, allowing users to write potentially complex
    algorithms.

    Parameters
    ----------
    iamdf : :obj:`pyam.IamDataFrame`
        :obj:`pyam.IamDataFrame` containing the data from which the aggregates can be calculated

    aggregates : dict{str: dict{str: float}}
        Dictionary specifying how to calculate the aggregates. Each key is the
        name of an aggregate variable to be calculated. Each value is itself a
        dictionary. The keys are variables which already exist in ``iamdf``
        and the values of are constants which are multiplied by the value of
        that variable's data before being included in the aggregate (i.e. sum).

    Returns
    -------
    :obj:`pyam.IamDataFrame`
        :obj:`pyam.IamDataFrame` containing the aggregate data (can be
        appended to the source :obj:`pyam.IamDataFrame` by the user if
        desired).

    Examples
    --------
    # simply take aggregate of multiple variables
    bulk_aggregate(
        iamdf=input_df,
        aggregates={
            "Emissions|CO2": {
                "Emissions|CO2|Industrial": 1, 
                "Emissions|CO2|AFOLU": 1
            },
            "Emissions|CH4": {
                "Emissions|CH4|Industrial": 1, 
                "Emissions|CH4|AFOLU": 1
            },
        },
    )

    # one variable is difference between two others, one is sum of one
    # variable plus two times another
    bulk_aggregate(
        iamdf=input_df,
        aggregates={
            "Emissions|CO2|AFOLU": {
                "Emissions|CO2": 1, 
                "Emissions|CO2|Industrial": -1
            },
            "Emissions|SOx (RF weighted)": {
                "Emissions|SOx|Industrial": 2, 
                "Emissions|SOx|AFOLU": 1,
            },
        },
    )
    """
Rlamboll commented 4 years ago

Yeah, I don't know that there's a pressing need for it other than to infill values defined by consistency conditions, which is very much something you'd download silicone to do. I don't mind it going into pyam wholesale, but it feels time-consuming to duplicate it with slightly different options concerning the inputs in both pyam and silicone.

znicholls commented 4 years ago

Just for completeness, a big benefit of having it in pyam is that you have a bigger team of maintainers and users. Happy with whatever though, just wanted to ask the question.

danielhuppmann commented 4 years ago

Yes, pushing features that be useful to many users "upstream" is definitely welcome in principle - maybe just too specific for the infilling use-case with factors.

But let me reiterate that an extension of aggregate() and similar function to take a mapping dictionary would be welcome, i.e, if you have a mapping = {variable: [<list of components>]}, currently one needs to do:

for v, lst in mapping.items():
    df.aggregate(v, lst)

This could be streamlined to allow:

df.aggregate(mapping)