Closed dpvanbalen closed 3 years ago
Some quick benchmarks:
All versions were tested with an array of 2^24, 2^24+1, 2^30, or 2^30+1 Int32
s. The two tests were a fold
and a scan
, both using const
as a minimal combination function. Time is measured as the sum of the kernel execution times (P1+P2+P3 for scan
, M1+M2 for fold
) as reported by nvprof. Measurements done by hand (and thus only once), but the differences are big enough to confidently say it's not within the variation. This is reflected by the execution time of the "+1" versions very closely matching the round sample sizes (with one exception, where a shuffle-based fold sped up by adding this element).
fold
Speed differences of 23.6%, 23.9%, 16.5% and 21.6% respectively.
scan
Speed differences of 5.3%, 4.8%, 3.5% and 3.3% respectively.
Overall, folds seem to profit more from this change than scans did.
Sorry this took so long for me to get to! >.<
Thanks for the thorough review, this looks very clean!
Description
Motivation and context
Shuffles are, by now, available on most GPU's and should be significantly faster than shared-memory based equivalent solutions.
How has this been tested?
Nofib passes, using both the shfl_sync and the smem versions. Testing the shfl (without _sync) version is more difficult, as it requires an old GPU (I think... Maybe just an old version of CUDA).
Types of changes
Checklist:
TODO:
width
-computations are needed, or if some magic constant sufficises for all warpsizes. Note that any CUDA implementation, at least when compiled via clang, also does these computations on every shuffle invoke.