AccelerateHS / accelerate-llvm

LLVM backend for Accelerate
http://www.acceleratehs.org
Other
155 stars 50 forks source link

Implement shuffle-based scan and fold #64

Closed dpvanbalen closed 3 years ago

dpvanbalen commented 3 years ago

Description

Motivation and context

Shuffles are, by now, available on most GPU's and should be significantly faster than shared-memory based equivalent solutions.

How has this been tested?

Nofib passes, using both the shfl_sync and the smem versions. Testing the shfl (without _sync) version is more difficult, as it requires an old GPU (I think... Maybe just an old version of CUDA).

Types of changes

Checklist:

TODO:

dpvanbalen commented 3 years ago

Some quick benchmarks:

All versions were tested with an array of 2^24, 2^24+1, 2^30, or 2^30+1 Int32s. The two tests were a fold and a scan, both using const as a minimal combination function. Time is measured as the sum of the kernel execution times (P1+P2+P3 for scan, M1+M2 for fold) as reported by nvprof. Measurements done by hand (and thus only once), but the differences are big enough to confidently say it's not within the variation. This is reflected by the execution time of the "+1" versions very closely matching the round sample sizes (with one exception, where a shuffle-based fold sped up by adding this element).

fold Speed differences of 23.6%, 23.9%, 16.5% and 21.6% respectively.

scan Speed differences of 5.3%, 4.8%, 3.5% and 3.3% respectively.

Overall, folds seem to profit more from this change than scans did.

tmcdonell commented 3 years ago

Sorry this took so long for me to get to! >.<

dpvanbalen commented 3 years ago

Thanks for the thorough review, this looks very clean!