Use either -arch=all-major or for cuda versions < 11.5 that don't have this the equivalent nvcc options (but limit f16 code to architectures that support it). The added complexity to limit fp16 code to newer architectures gave an insignificant build time advantage, so was removed.
Seems to fix the reported cuda NaN issues.
Use either
-arch=all-major
or for cuda versions < 11.5 that don't have this the equivalent nvcc options(but limit f16 code to architectures that support it). The added complexity to limit fp16 code to newer architectures gave an insignificant build time advantage, so was removed. Seems to fix the reported cuda NaN issues.