Performance improvement for region-aggregation processing

IAMconsortium / nomenclature

A package to work with IAMC-style variable templates

https://nomenclature-iamc.readthedocs.io/

Apache License 2.0

19 stars 14 forks source link

Performance improvement for region-aggregation processing #48

Closed danielhuppmann closed 2 years ago

danielhuppmann commented 2 years ago

The current implementation iterates over all common-regions, then creating the variable-kwargs-dictionary, then iterating over each variable.

Two ways to significantly improve performance:

create variable-kwargs-dictionary before iterating over common regions
The pyam aggregate_region() method can take a list of variables if there are no additional arguments (weight, method, ...). So the variable-kwargs-dictionary could be distinguished into a "summed variables"-list plus a "other-method variables" dictionary.

phackstock commented 2 years ago

Ad 1. Ah yes, that's a blunder on my side, sorry for that. Just issued #50 that addresses this. Ad 2. That was my initial design but I found it clearer to read if I just a a single variable dictionary that I iterate over and then pass the kwargs to pyam.IamDataFrame.aggregate_region()instead of creating two. Would it actually bring a performance boost if we gave it a list of variables or just delegate the looping from nomenclature to pyam?

danielhuppmann commented 2 years ago

Would it actually bring a performance boost if we gave it a list of variables or just delegate the looping from nomenclature to pyam?

Yes, because pyam doesn’t iterate over the variables - it uses pandas.groubpy() only once for the list of variables.

phackstock commented 2 years ago

Then the question would be how pandas solves the groupby because at some point there has to be a loop. It might be not be implemented in python though and would therefore might be faster than a native python loop.