eyalroz / cuda-kat

CUDA kernel author's tools
BSD 3-Clause "New" or "Revised" License
105 stars 8 forks source link

Rearrange math builtins #32

Open eyalroz opened 4 years ago

eyalroz commented 4 years ago

"Built-ins" in cuda-kat means those functions which translate into single PTX instructions (not necessarily single SASS instructions though!)

We have on_device/builtins.cuh, and on_device/non-builtins.cuh which contains functions which are builtin-like, or one might expect to be built-in, but aren't, and instead have a pretty tight implementation - one or two lines - which calls a builtin.

There's a problem, though, with certain functions which only translate into single PTX instructions when --use_fast_math is specified as a compiler switch. Example: cosf() in the CUDA math function header. With --use_fast_math, it yields something like:

        cos.approx.ftz.f32      %f2, %f1;

but if you remove the switch, you get a sequence of over 150 (!!) PTX instructions, including several loops, for computing the cosine. See this on GodBolt.

So, cosf() and other functions are sometimes builtins and sometime they aren't. Where should we put them then?

Idea:

PS:

--use_fast_math is not even the only switch: We have --prec-div (for precise division) --prec-sqrt (for precise square root) and --fmad for enabling floating-point fused-multiply-add instructions. I should also not that not all functions which depend on these four switches are actually covered by cuda-kat right now, although that's a different issue.