diku-dk / futhark

:boom::computer::boom: A data-parallel functional programming language
http://futhark-lang.org
ISC License
2.41k stars 166 forks source link

Should we always round correctly on OpenCL? #2155

Closed athas closed 3 months ago

athas commented 5 months ago

The OpenCL spec allows implementations to round single-precision floating point operations incorrectly. You have to pass -cl-fp32-correctly-rounded-divide-sqrt to the kernel compiler to make it round correctly.

Should we pass this option by default? From what I can determine, CUDA does the equivalent of passing this option by default, so it would harmonise behaviour between backends.

laurentpayot commented 5 months ago

I’m surprised to learn you can get different results with different platforms. That’s an issue for me as I use a couple of different machines. So yes, personally, I’m for passing it by default with an option to opt it out.

FluxusMagna commented 5 months ago

Consistency is nice of course, but with parallel reductions and scans you can't ensure it anyway. What is the cost of this option in terms of performance? I think I'd prefer as much consistency as possible anyway, but if it significantly lowers performance it needs to be well documented so that users are aware.

athas commented 5 months ago

The performance difference can be quite dramatic in some cases on some GPUs, if you have workloads that are dominated by square root calculations (e.g. for Mandelbrot fractals). But it's equivalent to just passing -ffast-math-like options, which I am a bit wary of.

laurentpayot commented 5 months ago

workloads that are dominated by square root calculations (e.g. for Mandelbrot fractals)

Really? When I was in high school (last century) I was playing with the Mandelbrot set on my 8 bit computer. I remember I was super happy to realize I could get rid of the costly square root operation simply by using an hard coded squared value (the "bailout") on the other side of the condition. Correct me if I’m wrong...

athas commented 5 months ago

Sure, I'm not saying that it's the best way to implement Mandelbrot fractals, I'm just using it as an example of a specific program in our benchmark suite that turned out to benefit from OpenCL's default of inaccurate-but-fast square roots on my MI100 GPU (but not on A100). I can imagine other programs that are similarly squareroot-heavy (nbody), but they also do plenty of other things, so the effect is not as pronounced.

If anyone is curious, I'm actually working on a systematic investigation of CUDA/OpenCL and HIP/OpenCL performance differences. I have these two graphs, where a higher number means that OpenCL is faster than CUDA or HIP respective:

image

There are many root causes of the differences, some of which were surprising. The behaviour discussed in this issue is one of the latter, but it mostly affects benchmarks I would say are not "realistic".

FluxusMagna commented 4 months ago

Those graphs are really interesting! I had no idea the differences would be so large. Most of the compiler optimisations should be the same, right? Aside from the fast scan, which iirc was hard/impossible to implement in OpenCL, the backends should essentially be using 'equivalent' code for almost everything at the last human readable step(if cuda/hip and opencl can be considered as such)? The differences also seem to not correlate that well between the different platforms. I guess this type of data is very useful to find potential missing optimisations.

athas commented 4 months ago

The generated code is essentially the same. The main difference is the choice of scan implementation, as well as a few minor details for certain histogram operators. I am certainly going to put in some work in order to try to minimise the differences, and one (relatively easy) one is covered by exactly this issue. (Unfortunately, here the solution is to make OpenCL slower by default, but such is life.)