Use of `std::sin` and `std::cos` in device code generates unwanted FP64 instructions

stephenswat commented 1 year ago

@krasznaa has recently been on a crusade to make traccc work with his non-FP64-compatible GPU (see e.g. #333 and #335). Instead of hunting these errors down manually, we can do this automatically (see #336). However, the way we have decided to program traccc and its dependencies (in particular detray) will make it difficult to completely eliminate the slow 64-bit instructions. Consider the following source code that is generated in fitting_algorithm.ptx:

.func  (.param .b32 func_retval0) cosf(
    .param .b32 cosf_param_0
)
{
    .local .align 4 .b8     __local_depot1092[28];
    .reg .b64   %SP;
    .reg .b64   %SPL;
    .reg .pred  %p<22>;
    .reg .f32   %f<44>;
    .reg .b32   %r<69>;
    .reg .f64   %fd<3>;
    .reg .b64   %rd<28>;

    mov.u64     %SPL, __local_depot1092;
    cvta.local.u64  %SP, %SPL;
    ld.param.f32    %f16, [cosf_param_0];
    bra.uni     $L__BB1092_1;

   ...

$L__BB1092_13:
    mov.u32     %r28, %r66;
    mov.u32     %r27, %r65;
    mov.u32     %r26, %r64;
    cvt.u64.u32     %rd15, %r27;
    shl.b64     %rd16, %rd15, 32;
    cvt.u64.u32     %rd17, %r28;
    or.b64      %rd18, %rd16, %rd17;
    cvt.rn.f64.s64  %fd1, %rd18;
    mul.f64     %fd2, %fd1, 0d3BF921FB54442D19;
    cvt.rn.f32.f64  %f3, %fd2;
    setp.ne.s32     %p13, %r26, 0;
    not.pred    %p14, %p13;
    mov.f32     %f38, %f3;
    @%p14 bra   $L__BB1092_15;
    bra.uni     $L__BB1092_14;

It is not hard to identify that the 64-bit floating point instructions are being generated as a result of the use of std::sin. There is a similar case with the use of std::cos. The canonical way of implementing this in CUDA, if single-precision does indeed provide sufficient precision, is to use the __sinf compiler intrinsic. Currently, we don't really have a way of controlling the implementation that is used, as this is abstracted away behind detray and algebra-plugins.

krasznaa commented 1 year ago

Bull#$^... :confused: Double- and triple-check that we are not mistakenly providing double inputs to those trigonometric functions. I very much suspect that we are.

If anything, we may want to switch to using std::sinf and friends.

But in the end we shouldn't be using any of those. We'll need to make all of them use the trigonometric functions from:

https://github.com/acts-project/algebra-plugins/blob/main/math/common/include/algebra/math/common.hpp

sin/cos is not there yet, but there are for instance a number of places in our code where std::sqrt is used instead of algebra::math::sqrt.

stephenswat commented 1 year ago

I invite you to compile the following extremely trivial CUDA code and inspect the PTX:

#include <cmath>

__global__ void sins(float * f) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    f[tid] = std::sin(f[tid]);
}

nvcc -c --keep test.cu && cat test.ptx and you should find those 64-bit floating point instructions.

Then please consider the following lines:

https://github.com/acts-project/detray/blob/main/core/include/detray/tracks/detail/track_helper.hpp#L89 https://github.com/acts-project/detray/blob/main/core/include/detray/definitions/math.hpp#L24

That should clear up what's happening.

Looks like it's detray and not algebra plugins, but potato/potato.

stephenswat commented 1 year ago

Pinging @niermann999 @beomki-yeo.

stephenswat commented 1 year ago

Something else to mention: this effect goes away when using --use_fast_math. We'll need to check whether using pure 32-bit trigonometry provides us with the precision we need. If it does, we should consider whether we want to switch to using intrinsic trig functions whether we want to enable non-compliant math.

stephenswat commented 1 year ago

Okay, so after looking into this a bit more, the use of double-precision in the single-precision trigonometry functions is a relatively uncommon branch to cover subnormal floating point numbers. The difference in performance between std::sin and __sinf is small but certainly present; taking the sine of one billion floating point numbers on an A6000 is approximately 3% faster using the intrinsic. However, this is in a very simple kernel. Interestingly, the implementation using the standard library may have significantly higher register pressure. I cannot currently say anything about the liveness of those registers, but this effect could potentially impact the performance of trigonometry functions in code that is already bound by register pressure. According to the PTX standard, the use of the sine and cosine have a maximum absolute error of 2^-20.9 or 5.1 × 10^-7.

The way in which we proceed here should depend on our attitude towards the use of double-precision floating point numbers and our willingness to sacrifice performance for convenience. If we wish to completely eliminate double-precision floating point numbers, we should exclusively use the approximation intrinsics. For performance, this would also be preferable, but this would involve some work to incorporate it into detray and algebra plugins. The alternative would be to enable fast math, but this may have other unintended side-effects.

acts-project / traccc

Use of `std::sin` and `std::cos` in device code generates unwanted FP64 instructions #337