Implement shuffle-based scan and fold

AccelerateHS / accelerate-llvm

LLVM backend for Accelerate

http://www.acceleratehs.org

Other

155 stars 50 forks source link

Implement shuffle-based scan and fold #64

Closed dpvanbalen closed 3 years ago

dpvanbalen commented 3 years ago

Description

Motivation and context

Shuffles are, by now, available on most GPU's and should be significantly faster than shared-memory based equivalent solutions.

How has this been tested?

Nofib passes, using both the shfl_sync and the smem versions. Testing the shfl (without _sync) version is more difficult, as it requires an old GPU (I think... Maybe just an old version of CUDA).

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

[ ] My code follows the code style of this project.
[ ] My change requires a change to the documentation.
[ ] I have updated the documentation accordingly.
[ ] I have added tests to cover my changes.
[x] All new and existing tests passed.

TODO:

[x] Check whether the width-computations are needed, or if some magic constant sufficises for all warpsizes. Note that any CUDA implementation, at least when compiled via clang, also does these computations on every shuffle invoke.
[ ] Figure out why we can't decrease the smem size in FoldSeg without crashing
[x] Benchmark!

dpvanbalen commented 3 years ago

Some quick benchmarks:

All versions were tested with an array of 2^24, 2^24+1, 2^30, or 2^30+1 Int32s. The two tests were a fold and a scan, both using const as a minimal combination function. Time is measured as the sum of the kernel execution times (P1+P2+P3 for scan, M1+M2 for fold) as reported by nvprof. Measurements done by hand (and thus only once), but the differences are big enough to confidently say it's not within the variation. This is reflected by the execution time of the "+1" versions very closely matching the round sample sizes (with one exception, where a shuffle-based fold sped up by adding this element).

fold Speed differences of 23.6%, 23.9%, 16.5% and 21.6% respectively.

scan Speed differences of 5.3%, 4.8%, 3.5% and 3.3% respectively.

Overall, folds seem to profit more from this change than scans did.

tmcdonell commented 3 years ago

Sorry this took so long for me to get to! >.<

dpvanbalen commented 3 years ago

Thanks for the thorough review, this looks very clean!