gidden commented 1 year ago

Please confirm that this PR has done the following:

[x] Tests Added
[ ] Documentation Added
[x] Name of contributors Added to AUTHORS.rst
[ ] Description in RELEASE_NOTES.md Added

Description of PR

This PR attempts to make initialization of dataframes faster if possible. The goal of this PR is to start a conversation about how to support faster initialization and provide tools to that effect.

It adds a method to initialize a dataframe with minimal checking called fast_format_data() and a fast kwarg to the __init__() method. I see a ~30% speed up on a real dataset (AR6), and show the profiling of increasing datasizes (though I randomly generate the data, so we won't see effects of faster sorting with common model/scenario/variable names etc.). In the graph, N denotes the number of rows in the original wide-format dataframe. The image is generated by profile_init.py, and the largest datapoint comes from reading in the AR6 database.

profile_init

Note that the last point is reading in AR6 data, where I am guessing speed ups come from less heterogenous data

New version showing refactor to stack (which is now called slow in the figure), which goes back to a 30% speedup for 10**7 rows or 35% speed up for reading in AR6 between the new implementations. Both are significantly faster than melt (e.g., fast is 70% faster than old for AR6).

profile_init

gidden commented 1 year ago

Hmm, some odd plotting test failures:

Windows: py3.9 (py3.10 looks auth related)
Mac: py3.9
Ubuntu: py3.9

danielhuppmann commented 1 year ago

Riffing off this comment by @coroa

A Series is pyam's backend format. I'd really like to see Series supported by the fast case.

The expensive operations in the i/o chain are converting between long and wide format. I think that the biggest performance gain would be using a file format that supports long format...

Also, format_data() always turns a Series into a DataFrame, see

https://github.com/IAMconsortium/pyam/blob/a9bb3c3b996576e2f081dd7365b288d1a0a9f6cf/pyam/utils.py#L186

The most effective performance boost may be simply add a direct processing route for a Series.

gidden commented 1 year ago

Thanks @coroa and @danielhuppmann - I've updated the PR as you suggest and have a series of requirements and fast_format_data now assumes either a series or a multi-index wide-form dataframe.

I haven't updated the profile for this, nor have I checked if there are other tests where this can be added as a parameter. I suspect we should probably do more of that, so please let me know if you already know of cases where we could add this.

danielhuppmann commented 1 year ago

I'm wondering a bit about the use case and where the performance boost could come from...

I assume that the main use case is reading (large) files created by pyam, ie. wide IAMC-format, where we know that we have the "correct" ordering of columns.

So the expensive operations are (based on my limited understanding):

set the index: not really a way around that...
melt from wide to long
sort the index: given that DataFrame.is_monotonic_increasing is fast, I doubt that ordering would be a performance drag if the input data is already sorted

So the main issue is melt, where @coroa suggested to use stack instead. But... Why not simply refactor format_data() to use stack instead of implementing a parallel method...?

gidden commented 1 year ago

Refactored and compared

danielhuppmann commented 1 year ago

Thanks @gidden - sweet that stack() really gives such a performance boost!

In light of that, I think it would be prudent to not have an extra "fast" option - it's only a small performance boost compared to a lot of extra overhead to carry around...

gidden commented 1 year ago

I'm indifferent here - @coroa any opinions?

gidden commented 1 year ago

closing in favor of #729 and #727

IAMconsortium / pyam

initial attempt at a fast init #726

Please confirm that this PR has done the following:

Description of PR