danielhuppmann commented 3 years ago

Please confirm that this PR has done the following:

[x] Tests Added
[x] Documentation Added
~Name of contributors Added to AUTHORS.rst~
[x] Description in RELEASE_NOTES.md Added

Description of PR

@byersiiasa asked me about the rename() method, which was causing some headache in the AR6 processing workflow. I had a look and realized that it was operating on a full DataFrame rather than the indexed Series (still from before the refactoring last summer).

So I refactored it to use the pyam.index-module functions.

Performance improvement

Using a 140MB-file from the ongoing ENGAGE work, a multi-dimensional rename (touching some regions and some variables) went from 14 seconds to <4 seconds (and I used pyam.compare() to assert that they are identical).

API change

While working on this, I noticed that the current implementation can be used as an equivalent to the aggregate() method by mapping several items to the same new name, e.g., df.rename(variable={"foo": "new", "bar": "new"}. Currently, this will implicitly apply a groupby-sum, and only raise an error if a variable "new" already exists.

(This error can be silenced with check_duplicates=False, in which case groupby-sum will be applied to have a unique timeseries data index).

The new implementation shows a deprecation warning when renaming to a non-unique index, and it will switched to an error with release 1.0.

(The override check_duplicates=False will continue to work.)

codecov[bot] commented 3 years ago

Codecov Report

Merging #550 (6318a9f) into main (979dfe2) will decrease coverage by 0.0%. The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main    #550     +/-   ##
=======================================
- Coverage   93.5%   93.5%   -0.1%     
=======================================
  Files         48      48             
  Lines       5310    5321     +11     
=======================================
+ Hits        4970    4979      +9     
- Misses       340     342      +2

Impacted Files	Coverage Δ
pyam/logging.py	`59.3% <ø> (ø)`
pyam/core.py	`92.5% <100.0%> (-0.3%)`	:arrow_down:
pyam/index.py	`97.9% <100.0%> (+0.4%)`	:arrow_up:
tests/test_index.py	`100.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 979dfe2...6318a9f. Read the comment docs.

byersiiasa commented 3 years ago

I have done a bit of testing

smaller files (35 MB xlsx) - 11 regions rename

v0.11.xxx = ~12s
this version: = ~6s

Large file ~30 million rows (in long format - 650 MB csv) - 335 regions rename

v0.11.xxx = ~53s
this version: = ~49s

which is a bit strange -...

What it appears is actually the time scales with # of regions being renamed Large file (old vs new):

335 regions: 53 vs 49
200 regions: 36 vs 34
100 regions: 17 vs 17
10 regions: 16.5 vs 16.5

Anyways - I think its a good improvement - thanks @danielhuppmann !

danielhuppmann commented 3 years ago

Thanks @byersiiasa for doing more elaborate tests! Bit surprised that you don't see more of a performance improvement...

But it does make sense that doing a list comprehension with a large mapping inside is not efficient. So I added two things:

~use pd.Series.replace instead of the mapping inside a list comprehension~
use the replace_index_values() function in a nested manner
do a fast-pass if all rows of the timeseries data are touched (working directly on the index codes/levels rather than touching index-level values)

In my example, this gives another 10-20% performance improvement (now testing a one-item dictionary, several-items and all-items).

IAMconsortium / pyam

Refactor the `rename()` method for performance improvement #550

Please confirm that this PR has done the following:

Description of PR

Performance improvement

API change

Codecov Report