Open bhargavajs07 opened 5 years ago
I also hit a similar issue. When compiling the code using nvcc from cuda-10, the resulting program would generate outputs with error with larger then 1e-5. When I switched back to nvcc from cuda-9, the result seems to be correct, but the execution would report an #13 error for cublasGemmEx function.
While I can't speak to what's going on under the hood (certainly didn't look at the generated PTX) I proposed a fix in #23
Turing is actually sm_75
and targeting that during compilation resolves the issue.
simpleTensorCoreGEMM has errors in output(beyond the additive tolerance of 1e-5 and multiplicative tol of 1.01) when compiled with CUDA10 for Turing GPU (arch=sm_70, RTX 2080Ti)
I did not modify any datatypes in the run and both the wmma based explicit GEMM implementation and the cuBlasGemmEx call use the Tensorcores.
I am wondering what might be causing the errors beyond the specified tolerance limits?