Closed danielhuppmann closed 3 years ago
Merging #550 (6318a9f) into main (979dfe2) will decrease coverage by
0.0%
. The diff coverage is100.0%
.
@@ Coverage Diff @@
## main #550 +/- ##
=======================================
- Coverage 93.5% 93.5% -0.1%
=======================================
Files 48 48
Lines 5310 5321 +11
=======================================
+ Hits 4970 4979 +9
- Misses 340 342 +2
Impacted Files | Coverage Δ | |
---|---|---|
pyam/logging.py | 59.3% <ø> (ø) |
|
pyam/core.py | 92.5% <100.0%> (-0.3%) |
:arrow_down: |
pyam/index.py | 97.9% <100.0%> (+0.4%) |
:arrow_up: |
tests/test_index.py | 100.0% <100.0%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 979dfe2...6318a9f. Read the comment docs.
I have done a bit of testing
smaller files (35 MB xlsx) - 11 regions rename
Large file ~30 million rows (in long format - 650 MB csv) - 335 regions rename
which is a bit strange -...
What it appears is actually the time scales with # of regions being renamed Large file (old vs new):
Anyways - I think its a good improvement - thanks @danielhuppmann !
Thanks @byersiiasa for doing more elaborate tests! Bit surprised that you don't see more of a performance improvement...
But it does make sense that doing a list comprehension with a large mapping inside is not efficient. So I added two things:
replace_index_values()
function in a nested mannerIn my example, this gives another 10-20% performance improvement (now testing a one-item dictionary, several-items and all-items).
Please confirm that this PR has done the following:
Description of PR
@byersiiasa asked me about the
rename()
method, which was causing some headache in the AR6 processing workflow. I had a look and realized that it was operating on a full DataFrame rather than the indexed Series (still from before the refactoring last summer).So I refactored it to use the pyam.index-module functions.
Performance improvement
Using a 140MB-file from the ongoing ENGAGE work, a multi-dimensional rename (touching some regions and some variables) went from 14 seconds to <4 seconds (and I used
pyam.compare()
to assert that they are identical).API change
While working on this, I noticed that the current implementation can be used as an equivalent to the
aggregate()
method by mapping several items to the same new name, e.g.,df.rename(variable={"foo": "new", "bar": "new"}
. Currently, this will implicitly apply a groupby-sum, and only raise an error if a variable "new" already exists.(This error can be silenced with
check_duplicates=False
, in which case groupby-sum will be applied to have a unique timeseries data index).The new implementation shows a deprecation warning when renaming to a non-unique index, and it will switched to an error with release 1.0.
(The override
check_duplicates=False
will continue to work.)