Closed gzhang closed 3 years ago
Are these failures reproducible?
And don't just dump error output in an issue, that's rude. Instead, add some details about your system (what version of Windows, how did you install Julia), the errors (e.g. are they reproducible, can you isolate them, ...), format your post, etc.
system: win 10 Julia version: 1.5.2 CUDA version: 11.1.1 Nvidia drivers: 457.9
Encountered very similar problem, and those failures are consistent and repeatable
system: win 10 CPU: Intel(R) Core(TM) i7 CPU RAM: 64GB GPU: RTX 2060 Julia version: 1.5.3/1.5.2
Test Summary: | Pass Fail Broken Total
Overall | 10682 8 5 10695
cublas | 1914 6 1920
exceptions | 15 2 17
Info: System information:
│ CUDA toolkit 11.1.1, artifact installation
│ CUDA driver 11.2.0
│ NVIDIA driver 460.20.0
│
│ Libraries:
│ - CUBLAS: 11.3.0
│ - CURAND: 10.2.2
│ - CUFFT: 10.3.0
│ - CUSOLVER: 11.0.1
│ - CUSPARSE: 11.3.0
│ - CUPTI: 14.0.0
│ - NVML: 11.0.0+460.20
│ - CUDNN: 8.0.4 (for CUDA 11.1.0)
│ - CUTENSOR: 1.2.1 (for CUDA 11.1.0)
│
│ Toolchain:
│ - Julia: 1.5.3
│ - LLVM: 9.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│
│ 1 device:
└ 0: GeForce RTX 2060 (sm_75, 5.052 GiB / 6.000 GiB available)
[ Info: Testing using 1 device(s): 1. GeForce RTX 2060
┌ Info: System information:
│ CUDA toolkit 11.1.1, artifact installation
│ CUDA driver 11.1.0
│ NVIDIA driver 457.9.0
│
│ Libraries:
│ - CUBLAS: 11.3.0
│ - CURAND: 10.2.2
│ - CUFFT: 10.3.0
│ - CUSOLVER: 11.0.1
│ - CUSPARSE: 11.3.0
│ - CUPTI: 14.0.0
│ - NVML: 11.0.0+457.9
│ - CUDNN: 8.0.4 (for CUDA 11.1.0)
│ - CUTENSOR: 1.2.1 (for CUDA 11.1.0)
│
│ Toolchain:
│ - Julia: 1.5.3
│ - LLVM: 9.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│
│ 1 device:
└ 0: Quadro P2000 (sm_61, 3.526 GiB / 4.000 GiB available)
[ Info: Testing using 1 device(s): 1. Quadro P2000 (UUID 9b0b39dd-2ad4-66d0-d456-01bb0741d565)
[ Info: Skipping the following tests: cutensor\base, cutensor\contractions, cutensor\elementwise_binary, cutensor\elementwise_trinary, cutensor\permutations, cutensor\reductions, device\wmma
system: Windows 10 Enterprise (20H2, 19042.630)
Visual Studio 2019: 16.8.2
NVIDIA Nsight Compute 2020.2.1
NVIDIA Nsight Visual Studio Edition 2020.2.1.20303
NVIDIA CUDA Runtime 11.1
NVIDIA Graphical Drivers: 457.09
Problems persist after CUDA.jl upgrade to version 2.3.0
Test Summary: | Pass Fail Broken Total
Overall | 8468 13 5 8486
cublas | 1911 9 1920
exceptions | 13 4 17
nvtx | No tests
texture | 38 4 42
threading | No tests
cudadrv\memory | 49 1 50
FAILURE
(Note: please use triple backticks to denote code listings)
I haven't been to reproduce any of these failures. So it would be useful if somebody who can could reproduce this failure to a single test, reduce it further, add some additional specifics, etc.
Here are a couple of these I was able to reproduce by copying and modifying some of the code from the test file:
using Revise, CUDA, LinearAlgebra
using CUDA.CUBLAS
using CUDA.CUBLAS: band, bandex
using LinearAlgebra
m = 20
n = 35
k = 13
elty=Float32
alpha = rand(elty)
beta = rand(elty)
A = rand(elty,m,k)
B = rand(elty,k,n)
try
C = alpha*(A\B)
dC = copy(dB)
CUBLAS.xt_trsm!('L','U','N','N',alpha,dA,dC)
# move to host and compare
h_C = Array(dC)
@assert C ≈ h_C
catch err
@warn "xt_trsm! gpu failed!! error: $err"
end
try
C = alpha*(A\B)
h_C = CUBLAS.xt_trsm('L','U','N','N',alpha,Array(dA),Array(dB))
@assert C ≈ h_C
catch err
@warn "xt_trsm cpu failed!! error: $err"
end
Output:
┌ Warning: xt_trsm! gpu failed!! error: DimensionMismatch("Both inputs should have the same number of rows")
└ @ Main C:\Users\Clinton\Dropbox\Projects\Tutorial\cudatest.jl:29
┌ Warning: xt_trsm cpu failed!! error: DimensionMismatch("Both inputs should have the same number of rows")
└ @ Main C:\Users\Clinton\Dropbox\Projects\Tutorial\cudatest.jl:37
Here are a couple of these I was able to reproduce by copying and modifying some of the code from the test file:
using Revise, CUDA, LinearAlgebra using CUDA.CUBLAS using CUDA.CUBLAS: band, bandex using LinearAlgebra m = 20 n = 35 k = 13 elty=Float32 alpha = rand(elty) beta = rand(elty) A = rand(elty,m,k) B = rand(elty,k,n) try C = alpha*(A\B) dC = copy(dB) CUBLAS.xt_trsm!('L','U','N','N',alpha,dA,dC) # move to host and compare h_C = Array(dC) @assert C ≈ h_C catch err @warn "xt_trsm! gpu failed!! error: $err" end try C = alpha*(A\B) h_C = CUBLAS.xt_trsm('L','U','N','N',alpha,Array(dA),Array(dB)) @assert C ≈ h_C catch err @warn "xt_trsm cpu failed!! error: $err" end
Output:
┌ Warning: xt_trsm! gpu failed!! error: DimensionMismatch("Both inputs should have the same number of rows") └ @ Main C:\Users\Clinton\Dropbox\Projects\Tutorial\cudatest.jl:29 ┌ Warning: xt_trsm cpu failed!! error: DimensionMismatch("Both inputs should have the same number of rows") └ @ Main C:\Users\Clinton\Dropbox\Projects\Tutorial\cudatest.jl:37
the same output:
┌ Warning: xt_trsm! gpu failed!! error: DimensionMismatch("Both inputs should have the same number of rows")
└ @ Main F:\Code\Julia_projects\testCuda.jl:23
┌ Warning: xt_trsm cpu failed!! error: DimensionMismatch("Both inputs should have the same number of rows")
└ @ Main F:\Code\Julia_projects\testCuda.jl:30
OK, so only cublasXt tests fail? We might be doing something legitimately wrong then, because I remember cuda-memcheck
complaining about how we pin our host memory there, which might behave differently on Windows.
That said, if you don't actively use those xt_
functions, the failures are harmless.
We might be doing something legitimately wrong then, because I remember
cuda-memcheck
complaining about how we pin our host memory there, which might behave differently on Windows.
I verified, and that doesn't hold true anymore with CUDA 11.1.
OK, so only cublasXt tests fail? We might be doing something legitimately wrong then, because I remember
cuda-memcheck
complaining about how we pin our host memory there, which might behave differently on Windows.That said, if you don't actively use those
xt_
functions, the failures are harmless.
Good to know. Speaking for myself I don't use these, but happy to help test if a Windows box is needed.
┌ Warning: xt_trsm! gpu failed!! error: DimensionMismatch("Both inputs should have the same number of rows")
Looks like you invalidly reduced the test to a failure that isn't like what happened originally though.
Anyway, if you can still reproduce the original failure, could you try adding a call to synchronize()
after every call to CUBLAS.xt_*
in test/cublas.jl
and see if that fixes the problems?
If you are saying it is a coincidence, that's fine, but the above code was pulled based on cross-referencing the op's original stack trace line numbers and the test failures on my system.
I checked using the above example and this didn't work. I'll try it on the master branch tests once I make the jump to 1.6.
If you are saying it is a coincidence, that's fine, but the above code was pulled based on cross-referencing the op's original stack trace line numbers and the test failures on my system.
The tests are very stateful, so you probably copied wrong definitions for some if the inputs, because the tests there never throw a DimensionMismatch nor do they in your original error report. So adding a synchronization point there is also not expected to do anything.
I've implemented the above suggestion here: https://github.com/JuliaGPU/CUDA.jl/pull/572
Do note this branch needs Julia#master, and the Windows nightlies are lagging, so you need an up-to-date build like https://julialangnightlies-s3.julialang.org/pretesting/winnt/x64/1.6/julia-377aa809eb-win64.exe. Also note the required GPUCompiler dependency isn't tagged yet, so you need to launch using julia --project
from CUDA.jl's checkout.
Ah I see what you mean. Here is hopefully a valid MWE:
using Revise, CUDA, LinearAlgebra, Random
using CUDA.CUBLAS
using CUDA.CUBLAS: band, bandex
Random.seed!(11)
function mwe()
local m = 20
local n = 35
local k = 13
elty=Float32
local alpha = rand(elty)
local beta = rand(elty)
local A = triu(rand(elty, m, m))
local B = rand(elty,m,n)
local C = zeros(elty,m,n)
local dA = CuArray(A)
local dB = CuArray(B)
local dC = CuArray(C)
local failed=false
try
C = alpha*(A\B)
dC = copy(dB)
CUBLAS.xt_trsm!('L','U','N','N',alpha,dA,dC)
CUDA.synchronize()
# move to host and compare
h_C = Array(dC)
@assert C ≈ h_C
catch err
@warn "xt_trsm! gpu failed!! error: $err"
failed=true
end
try
C = alpha*(A\B)
h_C = CUBLAS.xt_trsm('L','U','N','N',alpha,Array(dA),Array(dB))
CUDA.synchronize()
@assert C ≈ h_C
catch err
@warn "xt_trsm cpu failed!! error: $err"
failed=true
end
return failed
end
for i ∈ 1:10^3
if mwe()
@info "Failed on iteration $i"
break
end
end
Output:
┌ Warning: xt_trsm! gpu failed!! error: AssertionError("C ≈ h_C")
└ @ Main C:\Users\Clinton\Dropbox\Projects\Tutorial\cudatest.jl:39
[ Info: Failed on iteration 7
I'll give the PR a shot.
If that snippet here failed with the synchronize
in there, the PR won't help. Still, I'd appreciate if you could test it. Even with multiple iterations, I can't reproduce (haven't tried on Windows though).
Is there anything special about your set-up? Are you using the driver in WDDM or TCC mode? Hardware-accelerated GPU scheduling? Other special settings related to the GPU or CUDA?
EDIT: ah, I can finally reproduce this on my Windows system by doing multiple iterations.
Ah ok, I tried it and it didn't work as you say, basically the same results as the op. Nothing particularly special about my system- just a Razer laptop with a 2070 RTX running in WDDM with a bunch of monitors hooked up.
Edit: Note I tried it on the original #572 (67dfd4a)
I just tried it on #577 and everything worked!