Open stephenswat opened 1 year ago
Bull#$^... :confused: Double- and triple-check that we are not mistakenly providing double
inputs to those trigonometric functions. I very much suspect that we are.
If anything, we may want to switch to using std::sinf
and friends.
But in the end we shouldn't be using any of those. We'll need to make all of them use the trigonometric functions from:
sin/cos is not there yet, but there are for instance a number of places in our code where std::sqrt
is used instead of algebra::math::sqrt
.
I invite you to compile the following extremely trivial CUDA code and inspect the PTX:
#include <cmath>
__global__ void sins(float * f) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
f[tid] = std::sin(f[tid]);
}
nvcc -c --keep test.cu && cat test.ptx
and you should find those 64-bit floating point instructions.
Then please consider the following lines:
https://github.com/acts-project/detray/blob/main/core/include/detray/tracks/detail/track_helper.hpp#L89 https://github.com/acts-project/detray/blob/main/core/include/detray/definitions/math.hpp#L24
That should clear up what's happening.
Looks like it's detray and not algebra plugins, but potato/potato.
Pinging @niermann999 @beomki-yeo.
Something else to mention: this effect goes away when using --use_fast_math
. We'll need to check whether using pure 32-bit trigonometry provides us with the precision we need. If it does, we should consider whether we want to switch to using intrinsic trig functions whether we want to enable non-compliant math.
Okay, so after looking into this a bit more, the use of double-precision in the single-precision trigonometry functions is a relatively uncommon branch to cover subnormal floating point numbers. The difference in performance between std::sin
and __sinf
is small but certainly present; taking the sine of one billion floating point numbers on an A6000 is approximately 3% faster using the intrinsic. However, this is in a very simple kernel. Interestingly, the implementation using the standard library may have significantly higher register pressure. I cannot currently say anything about the liveness of those registers, but this effect could potentially impact the performance of trigonometry functions in code that is already bound by register pressure. According to the PTX standard, the use of the sine and cosine have a maximum absolute error of 2-20.9 or 5.1 × 10-7.
The way in which we proceed here should depend on our attitude towards the use of double-precision floating point numbers and our willingness to sacrifice performance for convenience. If we wish to completely eliminate double-precision floating point numbers, we should exclusively use the approximation intrinsics. For performance, this would also be preferable, but this would involve some work to incorporate it into detray and algebra plugins. The alternative would be to enable fast math, but this may have other unintended side-effects.
@krasznaa has recently been on a crusade to make traccc work with his non-FP64-compatible GPU (see e.g. #333 and #335). Instead of hunting these errors down manually, we can do this automatically (see #336). However, the way we have decided to program traccc and its dependencies (in particular detray) will make it difficult to completely eliminate the slow 64-bit instructions. Consider the following source code that is generated in
fitting_algorithm.ptx
:It is not hard to identify that the 64-bit floating point instructions are being generated as a result of the use of
std::sin
. There is a similar case with the use ofstd::cos
. The canonical way of implementing this in CUDA, if single-precision does indeed provide sufficient precision, is to use the__sinf
compiler intrinsic. Currently, we don't really have a way of controlling the implementation that is used, as this is abstracted away behind detray and algebra-plugins.