Closed computablee closed 10 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
e140dc3
) 99.12% compared to head (ad2c447
) 99.21%. Report is 7 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
After further performance testing of the updates to collapsed loops, it seems that performance may be worsened for average use cases. Avoiding merging this for now until more data can be collected.
Division-by-multiplication was removed in collapsed loops, and instead, manual iteration was implemented. This avoids any expensive operations like division, modulo, multiply, etc. The performance improvements from this are insane, well over 2x across the board for different approaches to loops.
Collapse(3)
was also optimized, although remains untested. I would be very shocked if performance gains were anything less than 3x. Collapse(4)
and Collapse(n)
remain unoptimized, due to code complexity. There should be a writeup discussing the "yes"s and "no"s of the library as far as performance. Collapse(4)
or higher is definitely a "no" for lightweight loops due to the extreme overhead of calculating indices.
Optimizing high-dimension collapsed loops shouldn't be too difficult if I get requests for it. Certainly a far easier approach than prior iterations of the collapsed chunk executor. Opening an issue and doing this later.
Which issue are you addressing?
Significant performance improvements, new work-stealing scheduler.
How have you addressed the issue?
This PR implements the
WorkStealingScheduler
class for parallel for loops which use a work-stealing scheduler. Much of the scheduling code has undergone serious optimization, including a 40% improvement in a particular benchmark forstatic
scheduling withchunk_size=1
. Improvements were made to collapsed loops as well, incorporating division-by-multiplication. More testing is required here.How have you tested your patch?
Unit tests have been written where necessary, and all unit tests pass.