NVIDIA-developer-blog / code-samples

Source code examples from the Parallel Forall Blog
BSD 3-Clause "New" or "Revised" License
1.24k stars 633 forks source link

simpleTensorCoreGEMM has errors in output when compiled with CUDA10 for Turing GPUs #18

Open bhargavajs07 opened 5 years ago

bhargavajs07 commented 5 years ago

simpleTensorCoreGEMM has errors in output(beyond the additive tolerance of 1e-5 and multiplicative tol of 1.01) when compiled with CUDA10 for Turing GPU (arch=sm_70, RTX 2080Ti)

I did not modify any datatypes in the run and both the wmma based explicit GEMM implementation and the cuBlasGemmEx call use the Tensorcores.

I am wondering what might be causing the errors beyond the specified tolerance limits?

hungweitseng commented 5 years ago

I also hit a similar issue. When compiling the code using nvcc from cuda-10, the resulting program would generate outputs with error with larger then 1e-5. When I switched back to nvcc from cuda-9, the result seems to be correct, but the execution would report an #13 error for cublasGemmEx function.

agschrei commented 4 years ago

While I can't speak to what's going on under the hood (certainly didn't look at the generated PTX) I proposed a fix in #23 Turing is actually sm_75 and targeting that during compilation resolves the issue.