Open maleadt opened 4 years ago
There's also https://github.com/JuliaGPU/CUDA.jl/blob/master/examples/peakflops.jl, which I tried a couple of years ago. I deemed it a good first issue because it's not terribly complicated from the CUDA.jl/Julia side, but it does need some figuring out to come up with a good implementation, I guess.
I'm currently attempting something like this in a package (we could add this feature to CUDA.jl afterwards). However, I wonder how you would implement this to reliably achieve peak performance? (In particular, given that you marked this as "good first issue".)
Currently I'm taking a very straightforward approach:
But this "only" gives
on a A100 SXM4 40GB, which (I think) has to be compared to 156 TFLOPS that NVIDIA seems to report as peak performance. (Note that this is with
CUDA.math_mode!(CUDA.FAST_MATH)
.)