Closed gidden closed 1 year ago
Hmm, some odd plotting test failures:
Riffing off this comment by @coroa
A Series is pyam's backend format. I'd really like to see Series supported by the fast case.
The expensive operations in the i/o chain are converting between long and wide format. I think that the biggest performance gain would be using a file format that supports long format...
Also, format_data()
always turns a Series into a DataFrame, see
The most effective performance boost may be simply add a direct processing route for a Series.
Thanks @coroa and @danielhuppmann - I've updated the PR as you suggest and have a series of requirements and fast_format_data
now assumes either a series or a multi-index wide-form dataframe.
I haven't updated the profile for this, nor have I checked if there are other tests where this can be added as a parameter. I suspect we should probably do more of that, so please let me know if you already know of cases where we could add this.
I'm wondering a bit about the use case and where the performance boost could come from...
I assume that the main use case is reading (large) files created by pyam, ie. wide IAMC-format, where we know that we have the "correct" ordering of columns.
So the expensive operations are (based on my limited understanding):
DataFrame.is_monotonic_increasing
is fast, I doubt that ordering would be a performance drag if the input data is already sortedSo the main issue is melt
, where @coroa suggested to use stack
instead. But... Why not simply refactor format_data()
to use stack
instead of implementing a parallel method...?
Refactored and compared
Thanks @gidden - sweet that stack()
really gives such a performance boost!
In light of that, I think it would be prudent to not have an extra "fast" option - it's only a small performance boost compared to a lot of extra overhead to carry around...
I'm indifferent here - @coroa any opinions?
closing in favor of #729 and #727
Please confirm that this PR has done the following:
Description of PR
This PR attempts to make initialization of dataframes faster if possible. The goal of this PR is to start a conversation about how to support faster initialization and provide tools to that effect.
It adds a method to initialize a dataframe with minimal checking called
fast_format_data()
and afast
kwarg to the__init__()
method. I see a ~30% speed up on a real dataset (AR6), and show the profiling of increasing datasizes (though I randomly generate the data, so we won't see effects of faster sorting with common model/scenario/variable names etc.). In the graph,N
denotes the number of rows in the original wide-format dataframe. The image is generated byprofile_init.py
, and the largest datapoint comes from reading in the AR6 database.Note that the last point is reading in AR6 data, where I am guessing speed ups come from less heterogenous data
New version showing refactor to stack (which is now called
slow
in the figure), which goes back to a 30% speedup for 10**7 rows or 35% speed up for reading in AR6 between the new implementations. Both are significantly faster than melt (e.g.,fast
is 70% faster thanold
for AR6).