JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.16k stars 206 forks source link

Recurrence of integer overflow bug (#1880) for a large matrix #2427

Closed rexyang624 closed 4 days ago

rexyang624 commented 5 days ago

I encountered this issue originally in an eigenvalue problem,

using LinearAlgebra, CUDA
a = CUDA.rand(Float64, 10000, 10000)
b = a + a'
eigen(b)

where I got the error

InexactError: trunc(Int32, 2410452488)
Stacktrace:
  [1] checked_trunc_sint
    @ ./boot.jl:656 [inlined]
  [2] toInt32
    @ ./boot.jl:693 [inlined]
  [3] Int32
    @ ./boot.jl:783 [inlined]
  [4] convert
    @ ./number.jl:7 [inlined]
  [5] cconvert
    @ ./essentials.jl:543 [inlined]
  [6] macro expansion
    @ ~/.julia/packages/CUDA/75aiI/lib/utils/call.jl:226 [inlined]
  [7] macro expansion
    @ ~/.julia/packages/CUDA/75aiI/lib/cusolver/libcusolver.jl:3054 [inlined]
  [8] #508
    @ ~/.julia/packages/CUDA/75aiI/lib/utils/call.jl:35 [inlined]
  [9] retry_reclaim
    @ ~/.julia/packages/CUDA/75aiI/src/memory.jl:434 [inlined]
 [10] check
    @ ~/.julia/packages/CUDA/75aiI/lib/cusolver/libcusolver.jl:24 [inlined]
 [11] cusolverDnDsyevd
    @ ~/.julia/packages/CUDA/75aiI/lib/utils/call.jl:34 [inlined]
 [12] (::CUDA.CUSOLVER.var"#1365#1367"{Char, Char, CuArray{Float64, 2, CUDA.DeviceMemory}, CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{Float64, 1, CUDA.DeviceMemory}, Int64, Int64})(buffer::CuArray{UInt8, 1, CUDA.DeviceMemory})
    @ CUDA.CUSOLVER ~/.julia/packages/CUDA/75aiI/lib/cusolver/dense.jl:640
 [13] with_workspaces(f::CUDA.CUSOLVER.var"#1365#1367"{Char, Char, CuArray{Float64, 2, CUDA.DeviceMemory}, CuArray{Int32, 1, CUDA.DeviceMemory}, CuArray{Float64, 1, CUDA.DeviceMemory}, Int64, Int64}, cache_gpu::Nothing, cache_cpu::Nothing, size_gpu::CUDA.CUSOLVER.var"#bufferSize#1366"{Char, Char, CuArray{Float64, 2, CUDA.DeviceMemory}, CuArray{Float64, 1, CUDA.DeviceMemory}, Int64, Int64}, size_cpu::Int64)
    @ CUDA.APIUtils ~/.julia/packages/CUDA/75aiI/lib/utils/call.jl:131
 [14] with_workspace
    @ ~/.julia/packages/CUDA/75aiI/lib/utils/call.jl:67 [inlined]
 [15] syevd!(jobz::Char, uplo::Char, A::CuArray{Float64, 2, CUDA.DeviceMemory})
    @ CUDA.CUSOLVER ~/.julia/packages/CUDA/75aiI/lib/cusolver/dense.jl:639
 [16] eigen(A::CuArray{Float64, 2, CUDA.DeviceMemory})
    @ CUDA.CUSOLVER ~/.julia/packages/CUDA/75aiI/lib/cusolver/linalg.jl:129
 [17] top-level scope
    @ REPL[12]:1

I did similar calculation half a year ago successfully on the same computer but was in an older version of CUDA.jl

Later I found out it was the same issue related to #1880, where I used the exact the same code:

using LinearAlgebra, CUDA
W = CUDA.rand(20000, 20000)
svd(W)

to reproduce similar not exactly the same error message:

InexactError: trunc(Int32, 3211456640)
Stacktrace:
  [1] checked_trunc_sint
    @ ./boot.jl:656 [inlined]
  [2] toInt32
    @ ./boot.jl:693 [inlined]
  [3] Int32
    @ ./boot.jl:783 [inlined]
  [4] convert
    @ ./number.jl:7 [inlined]
  [5] cconvert
    @ ./essentials.jl:543 [inlined]
  [6] macro expansion
    @ ~/.julia/packages/CUDA/75aiI/lib/utils/call.jl:226 [inlined]
  [7] macro expansion
    @ ~/.julia/packages/CUDA/75aiI/lib/cusolver/libcusolver.jl:4040 [inlined]
  [8] #662
    @ ~/.julia/packages/CUDA/75aiI/lib/utils/call.jl:35 [inlined]
  [9] retry_reclaim
    @ ~/.julia/packages/CUDA/75aiI/src/memory.jl:434 [inlined]
 [10] check
    @ ~/.julia/packages/CUDA/75aiI/lib/cusolver/libcusolver.jl:24 [inlined]
 [11] cusolverDnSgesvdj
    @ ~/.julia/packages/CUDA/75aiI/lib/utils/call.jl:34 [inlined]
 [12] (::CUDA.CUSOLVER.var"#1315#1317"{Char, Int64, CuArray{Float32, 2, CUDA.DeviceMemory}, CuArray{Int32, 1, CUDA.DeviceMemory}, Base.RefValue{Ptr{CUDA.CUSOLVER.gesvdjInfo}}, Int64, CuArray{Float32, 2, CUDA.DeviceMemory}, CuArray{Float32, 1, CUDA.DeviceMemory}, Int64, CuArray{Float32, 2, CUDA.DeviceMemory}, Int64, Int64, Int64})(work::CuArray{UInt8, 1, CUDA.DeviceMemory})
    @ CUDA.CUSOLVER ~/.julia/packages/CUDA/75aiI/lib/cusolver/dense.jl:490
 [13] with_workspaces(f::CUDA.CUSOLVER.var"#1315#1317"{Char, Int64, CuArray{Float32, 2, CUDA.DeviceMemory}, CuArray{Int32, 1, CUDA.DeviceMemory}, Base.RefValue{Ptr{CUDA.CUSOLVER.gesvdjInfo}}, Int64, CuArray{Float32, 2, CUDA.DeviceMemory}, CuArray{Float32, 1, CUDA.DeviceMemory}, Int64, CuArray{Float32, 2, CUDA.DeviceMemory}, Int64, Int64, Int64}, cache_gpu::Nothing, cache_cpu::Nothing, size_gpu::CUDA.CUSOLVER.var"#bufferSize#1316"{Char, Int64, CuArray{Float32, 2, CUDA.DeviceMemory}, Base.RefValue{Ptr{CUDA.CUSOLVER.gesvdjInfo}}, Int64, CuArray{Float32, 2, CUDA.DeviceMemory}, CuArray{Float32, 1, CUDA.DeviceMemory}, Int64, CuArray{Float32, 2, CUDA.DeviceMemory}, Int64, Int64, Int64}, size_cpu::Int64)
    @ CUDA.APIUtils ~/.julia/packages/CUDA/75aiI/lib/utils/call.jl:131
 [14] with_workspace
    @ ~/.julia/packages/CUDA/75aiI/lib/utils/call.jl:67 [inlined]
 [15] gesvdj!(jobz::Char, econ::Int64, A::CuArray{Float32, 2, CUDA.DeviceMemory}; tol::Float32, max_sweeps::Int64)
    @ CUDA.CUSOLVER ~/.julia/packages/CUDA/75aiI/lib/cusolver/dense.jl:489
 [16] gesvdj!
    @ ~/.julia/packages/CUDA/75aiI/lib/cusolver/dense.jl:450 [inlined]
 [17] _svd!
    @ ~/.julia/packages/CUDA/75aiI/lib/cusolver/linalg.jl:262 [inlined]
 [18] svd(A::CuArray{Float32, 2, CUDA.DeviceMemory}; full::Bool, alg::CUDA.CUSOLVER.JacobiAlgorithm)
    @ CUDA.CUSOLVER ~/.julia/packages/CUDA/75aiI/lib/cusolver/linalg.jl:252
 [19] top-level scope
    @ REPL[5]:1

I noticed this issued was fixed by modifying the sizeof(work) to length(work) in dense.jl, but many changes have been implemented afterwards. I'm not sure why this issue reappears.

I also checked everything is updated to the latest version:

CUDA runtime 12.5, artifact installation
CUDA driver 12.4
NVIDIA driver 550.90.7

CUDA libraries: 
- CUBLAS: 12.5.2
- CURAND: 10.3.6
- CUFFT: 11.2.3
- CUSOLVER: 11.6.2
- CUSPARSE: 12.4.1
- CUPTI: 23.0.0
- NVML: 12.0.0+550.90.7

Julia packages: 
- CUDA: 5.4.2
- CUDA_Driver_jll: 0.9.0+0
- CUDA_Runtime_jll: 0.14.0+1

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 4090 (sm_89, 21.996 GiB / 23.988 GiB available)
rexyang624 commented 5 days ago

I downgraded CUDA from v5.4.2 to v5.2.0 (the version which I think is before the update of lib/cusolver/dense.jl), and indeed the above calculation performs smoothly.

rexyang624 commented 5 days ago

It looks like it's a duplicated issue of #2413 associated with the implementation with the new 64-bit API, but I also dig into changes made from 5.2.0 to 5.3.0 in dense.jl

function bufferSize()
    out = Ref{Cint}(0)
    $bname(dense_handle(), jobz, uplo, n, A, lda, W, out)
-   return out[]
+   return out[] * sizeof($elty)
end

Is the issue caused by the size of the buffer got enlarged by sizeof($elty) which causing it beyond Int32?

rexyang624 commented 4 days ago

Calling the explicit 64-bit API CUDA.CUSOLVER.Xsyevd! resolved my issue.

maleadt commented 4 days ago

The high-level APIs should take care of this though. In any case, https://github.com/JuliaGPU/CUDA.jl/issues/2413 is similar, so I added your MWE there.