IAMconsortium / pyam

Analysis & visualization of energy & climate scenarios
https://pyam-iamc.readthedocs.io/
Apache License 2.0
226 stars 118 forks source link

Refactor the `rename()` method for performance improvement #550

Closed danielhuppmann closed 3 years ago

danielhuppmann commented 3 years ago

Please confirm that this PR has done the following:

Description of PR

@byersiiasa asked me about the rename() method, which was causing some headache in the AR6 processing workflow. I had a look and realized that it was operating on a full DataFrame rather than the indexed Series (still from before the refactoring last summer).

So I refactored it to use the pyam.index-module functions.

Performance improvement

Using a 140MB-file from the ongoing ENGAGE work, a multi-dimensional rename (touching some regions and some variables) went from 14 seconds to <4 seconds (and I used pyam.compare() to assert that they are identical).

API change

While working on this, I noticed that the current implementation can be used as an equivalent to the aggregate() method by mapping several items to the same new name, e.g., df.rename(variable={"foo": "new", "bar": "new"}. Currently, this will implicitly apply a groupby-sum, and only raise an error if a variable "new" already exists.

(This error can be silenced with check_duplicates=False, in which case groupby-sum will be applied to have a unique timeseries data index).

The new implementation shows a deprecation warning when renaming to a non-unique index, and it will switched to an error with release 1.0.

(The override check_duplicates=False will continue to work.)

codecov[bot] commented 3 years ago

Codecov Report

Merging #550 (6318a9f) into main (979dfe2) will decrease coverage by 0.0%. The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main    #550     +/-   ##
=======================================
- Coverage   93.5%   93.5%   -0.1%     
=======================================
  Files         48      48             
  Lines       5310    5321     +11     
=======================================
+ Hits        4970    4979      +9     
- Misses       340     342      +2     
Impacted Files Coverage Δ
pyam/logging.py 59.3% <ø> (ø)
pyam/core.py 92.5% <100.0%> (-0.3%) :arrow_down:
pyam/index.py 97.9% <100.0%> (+0.4%) :arrow_up:
tests/test_index.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 979dfe2...6318a9f. Read the comment docs.

byersiiasa commented 3 years ago

I have done a bit of testing

smaller files (35 MB xlsx) - 11 regions rename

Large file ~30 million rows (in long format - 650 MB csv) - 335 regions rename

which is a bit strange -...

What it appears is actually the time scales with # of regions being renamed Large file (old vs new):

Anyways - I think its a good improvement - thanks @danielhuppmann !

danielhuppmann commented 3 years ago

Thanks @byersiiasa for doing more elaborate tests! Bit surprised that you don't see more of a performance improvement...

But it does make sense that doing a list comprehension with a large mapping inside is not efficient. So I added two things:

In my example, this gives another 10-20% performance improvement (now testing a one-item dictionary, several-items and all-items).