Open mashu opened 1 year ago
Some additional information
OpenGL renderer string: NVIDIA GeForce RTX 3080/PCIe/SSE2
OpenGL core profile version string: 4.6.0 NVIDIA 510.85.02
OpenGL core profile shading language version string: 4.60 NVIDIA
mateusz@debian:~/model-zoo/vision/diffusion_mnist$ dpkg -l | grep nvidia
ii glx-alternative-nvidia 1.2.1 amd64 allows the selection of NVIDIA as GLX provider
ii libegl-nvidia0:amd64 510.85.02-2 amd64 NVIDIA binary EGL library
ii libgl1-nvidia-glvnd-glx:amd64 510.85.02-2 amd64 NVIDIA binary OpenGL/GLX library (GLVND variant)
ii libgles-nvidia1:amd64 510.85.02-2 amd64 NVIDIA binary OpenGL|ES 1.x library
ii libgles-nvidia2:amd64 510.85.02-2 amd64 NVIDIA binary OpenGL|ES 2.x library
ii libglx-nvidia0:amd64 510.85.02-2 amd64 NVIDIA binary GLX library
ii libnvidia-allocator1:amd64 510.85.02-2 amd64 NVIDIA allocator runtime library
ii libnvidia-cfg1:amd64 510.85.02-2 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-egl-gbm1:amd64 1.1.0-1 amd64 GBM EGL external platform library for NVIDIA
ii libnvidia-egl-wayland1:amd64 1:1.1.10-1 amd64 Wayland EGL External Platform library -- shared library
ii libnvidia-eglcore:amd64 510.85.02-2 amd64 NVIDIA binary EGL core libraries
ii libnvidia-encode1:amd64 510.85.02-2 amd64 NVENC Video Encoding runtime library
ii libnvidia-glcore:amd64 510.85.02-2 amd64 NVIDIA binary OpenGL/GLX core libraries
ii libnvidia-glvkspirv:amd64 510.85.02-2 amd64 NVIDIA binary Vulkan Spir-V compiler library
ii libnvidia-ml1:amd64 510.85.02-2 amd64 NVIDIA Management Library (NVML) runtime library
ii libnvidia-ptxjitcompiler1:amd64 510.85.02-2 amd64 NVIDIA PTX JIT Compiler library
ii libnvidia-rtcore:amd64 510.85.02-2 amd64 NVIDIA binary Vulkan ray tracing (rtcore) library
ii nvidia-alternative 510.85.02-2 amd64 allows the selection of NVIDIA as GLX provider
ii nvidia-driver 510.85.02-2 amd64 NVIDIA metapackage
ii nvidia-driver-bin 510.85.02-2 amd64 NVIDIA driver support binaries
ii nvidia-driver-libs:amd64 510.85.02-2 amd64 NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)
ii nvidia-egl-common 510.85.02-2 amd64 NVIDIA binary EGL driver - common files
ii nvidia-egl-icd:amd64 510.85.02-2 amd64 NVIDIA EGL installable client driver (ICD)
ii nvidia-installer-cleanup 20220217+1 amd64 cleanup after driver installation with the nvidia-installer
ii nvidia-kernel-common 20220217+1 amd64 NVIDIA binary kernel module support files
ii nvidia-kernel-dkms 510.85.02-2 amd64 NVIDIA binary kernel module DKMS source
ii nvidia-kernel-support 510.85.02-2 amd64 NVIDIA binary kernel module support files
ii nvidia-legacy-check 510.85.02-2 amd64 check for NVIDIA GPUs requiring a legacy driver
ii nvidia-modprobe 515.48.07-1 amd64 utility to load NVIDIA kernel modules and create device nodes
ii nvidia-persistenced 470.129.06-1 amd64 daemon to maintain persistent software state in the NVIDIA driver
ii nvidia-powerd 510.85.02-1 amd64 NVIDIA Dynamic Boost (daemon)
ii nvidia-settings 510.85.02-1 amd64 tool for configuring the NVIDIA graphics driver
ii nvidia-smi 510.85.02-2 amd64 NVIDIA System Management Interface
ii nvidia-support 20220217+1 amd64 NVIDIA binary graphics driver support files
ii nvidia-vaapi-driver:amd64 0.0.5-1+b1 amd64 VA-API implementation that uses NVDEC as a backend
ii nvidia-vdpau-driver:amd64 510.85.02-2 amd64 Video Decode and Presentation API for Unix - NVIDIA driver
ii nvidia-vulkan-common 510.85.02-2 amd64 NVIDIA Vulkan driver - common files
ii nvidia-vulkan-icd:amd64 510.85.02-2 amd64 NVIDIA Vulkan installable client driver (ICD)
ii xserver-xorg-video-nvidia 510.85.02-2 amd64 NVIDIA binary Xorg driver
(diffusion_mnist) pkg> st
Status `~/model-zoo/vision/diffusion_mnist/Project.toml`
[fbb218c0] BSON v0.3.5
[052768ef] CUDA v3.12.0
[0c46a032] DifferentialEquations v7.5.0
[634d3b9d] DrWatson v2.10.0
[5789e2e9] FileIO v1.15.0
[587475ba] Flux v0.13.6
[916415d5] Images v0.25.2
⌅ [eb30cadb] MLDatasets v0.6.0
[d96e819e] Parameters v0.12.3
[91a5bcdd] Plots v1.35.2
[92933f4c] ProgressMeter v1.7.2
[899adc3e] TensorBoardLogger v0.1.19
[56ddb016] Logging
[9a3f8284] Random
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated`
(diffusion_mnist) pkg> status --outdated
Status `~/model-zoo/vision/diffusion_mnist/Project.toml`
⌅ [eb30cadb] MLDatasets v0.6.0 (<v0.7.5) [compat]
julia> using CUDA
julia> CUDA.version()
v"11.6.0"
julia> CUDA.CUDNN.version()
v"8.30.2"
julia> using Flux
julia> x = rand(1000) |> gpu
1000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.12327004
0.77609754
0.97980183
...
Follow up on this
apt install cuda libcudnn8
julia> CUDA.versioninfo()
Downloaded artifact: CUDA
CUDA toolkit 11.7, artifact installation
NVIDIA driver 520.61.5, for CUDA 11.8
CUDA driver 11.8
Libraries:
Toolchain:
1 device: 0: NVIDIA GeForce RTX 3080 (sm_86, 9.198 GiB / 10.000 GiB available)
I was still getting error and suspected mismatch between 11.7 artifacts which is most recent provided by CUDA.jl and system's driver (however this shouldn't cause issues).
2. **Disabled CUDA.jl's artifacts**
I disabled binary builder by starting Julia with
JULIA_CUDA_USE_BINARYBUILDER=false julia
Now I am using system wide CUDA toolkit and CUDNN libraries, both are matching according to
julia> CUDA.versioninfo() CUDA toolkit 11.8, local installation NVIDIA driver 520.61.5, for CUDA 11.8 CUDA driver 11.8
Libraries:
Toolchain:
Environment:
1 device: 0: NVIDIA GeForce RTX 3080 (sm_86, 9.194 GiB / 10.000 GiB available)
Still the same error.
3. **CPU training works therefore debug CUDNN**
With some help on Slack we concluded that error actually is not from artifacts API mismatch. Given that model trains just fine on CPU, some wrapper in CUDA.jl is suspect.
I tried training with
JULIA_DEBUG=CUDNN JULIA_CUDA_USE_BINARYBUILDER=false julia
Which produced following trace
julia> train() [ Info: Training on GPU [ Info: Start Training, total 50 epochs [ Info: Epoch 1 ┌ Debug: CuDNN (v8600) function cudnnGetVersion() called: │ Time: 2022-10-06T08:51:43.469044 (0d+0h+0m+57s since start) │ Process=10202; Thread=10202; GPU=NULL; Handle=NULL; StreamId=NULL. └ @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/cudnn/CUDNN.jl:136 ┌ Debug: CuDNN (v8600) function cudnnCreateConvolutionDescriptor() called: │ convDesc: location=host; addr=0x7f0fdf75d340; │ Time: 2022-10-06T08:51:43.625100 (0d+0h+0m+57s since start) │ Process=10202; Thread=10202; GPU=NULL; Handle=NULL; StreamId=NULL. └ @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/cudnn/CUDNN.jl:136 ERROR: ┌ Debug: CuDNN (v8600) function cudnnSetConvolutionNdDescriptor() called: │ convDesc: location=host; addr=0x4b73bbf0; │ arrayLength: type=int; val=2; │ padA: type=int; val=[0,0]; │ strideA: type=int; val=[1,1]; │ dilationA: type=int; val=[1,1]; │ mode: type=cudnnConvolutionMode_t; val=CUDNN_CONVOLUTION (0); │ dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0); │ Time: 2022-10-06T08:51:43.687233 (0d+0h+0m+57s since start) │ Process=10202; Thread=10202; GPU=NULL; Handle=NULL; StreamId=NULL. └ @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/cudnn/CUDNN.jl:136 CUDNNError: ┌ Debug: CuDNN (v8600) function cudnnSetConvolutionMathType() called: │ convDesc: location=host; addr=0x4b73bbf0; │ mathType: type=cudnnMathType_t; val=CUDNN_TENSOR_OP_MATH (1); │ Time: 2022-10-06T08:51:43.687308 (0d+0h+0m+57s since start) │ Process=10202; Thread=10202; GPU=NULL; Handle=NULL; StreamId=NULL. └ @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/cudnn/CUDNN.jl:136 CUDNN_STATUS_BAD_PARAM (code 3) Stacktrace: [1] throw_api_error(res::CUDA.CUDNN.cudnnStatus_t) @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/cudnn/error.jl:22 [2] macro expansion @ ~/.julia/packages/CUDA/DfvRa/lib/cudnn/error.jl:35 [inlined] [3] cudnnSetConvolutionNdDescriptor(convDesc::Ptr{Nothing}, arrayLength::Int32, padA::Vector{Int32}, filterStrideA::Vector{Int32}, dilationA::Vector{Int32}, mode::CUDA.CUDNN.cudnnConvolutionMode_t, computeType::CUDA.CUDNN.cudnnDataType_t) @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/utils/call.jl:26 [4] cudnnSetConvolutionDescriptor(ptr::Ptr{Nothing}, padding::Vector{Int32}, stride::Vector{Int32}, dilation::Vector{Int32}, mode::CUDA.CUDNN.cudnnConvolutionMode_t, dataType::CUDA.CUDNN.cudnnDataType_t, mathType::CUDA.CUDNN.cudnnMathType_t, reorderType::CUDA.CUDNN.cudnnReorderType_t, groupCount::Int32) @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/cudnn/convolution.jl:135 [5] CUDA.CUDNN.cudnnConvolutionDescriptor(::Vector{Int32}, ::Vararg{Any}) @ CUDA.CUDNN ~/.julia/packages/CUDA/DfvRa/lib/cudnn/descriptors.jl:39 [6] CUDA.CUDNN.cudnnConvolutionDescriptor(cdims::DenseConvDims{2, 2, 2, 4, 2}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, pad::Tuple{Int64, Int64}) @ NNlibCUDA ~/.julia/packages/NNlibCUDA/kCpTE/src/cudnn/conv.jl:48 [7] cudnnConvolutionDescriptorAndPaddedInput @ ~/.julia/packages/NNlibCUDA/kCpTE/src/cudnn/conv.jl:43 [inlined] [8] ∇conv_data!(dx::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, dy::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, cdims::DenseConvDims{2, 2, 2, 4, 2}; alpha::Int64, beta::Int64, algo::Int64) @ NNlibCUDA ~/.julia/packages/NNlibCUDA/kCpTE/src/cudnn/conv.jl:98 [9] ∇conv_data! @ ~/.julia/packages/NNlibCUDA/kCpTE/src/cudnn/conv.jl:89 [inlined] [10] #∇conv_data#198 @ ~/.julia/packages/NNlib/0QnJJ/src/conv.jl:99 [inlined] [11] ∇conv_data @ ~/.julia/packages/NNlib/0QnJJ/src/conv.jl:95 [inlined] [12] #rrule#318 @ ~/.julia/packages/NNlib/0QnJJ/src/conv.jl:326 [inlined] [13] rrule @ ~/.julia/packages/NNlib/0QnJJ/src/conv.jl:316 [inlined] [14] rrule @ ~/.julia/packages/ChainRulesCore/C73ay/src/rules.jl:134 [inlined] [15] chain_rrule @ ~/.julia/packages/Zygote/dABKa/src/compiler/chainrules.jl:218 [inlined] [16] macro expansion @ ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0 [inlined] [17] _pullback @ ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:9 [inlined] [18] _pullback @ ~/.julia/packages/Flux/4k0Ls/src/layers/conv.jl:333 [inlined] [19] _pullback(ctx::Zygote.Context{true}, f::ConvTranspose{2, 4, typeof(identity), CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Bool}, args::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}) @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0 [20] _pullback @ ~/model-zoo/vision/diffusion_mnist/diffusion_mnist.jl:140 [inlined] [21] _pullback(::Zygote.Context{true}, ::UNet, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}) @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0 [22] _pullback @ ~/model-zoo/vision/diffusion_mnist/diffusion_mnist.jl:185 [inlined] [23] _pullback(::Zygote.Context{true}, ::typeof(model_loss), ::UNet, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::Float32) @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0 [24] _pullback @ ~/model-zoo/vision/diffusion_mnist/diffusion_mnist.jl:176 [inlined] [25] _pullback(::Zygote.Context{true}, ::typeof(model_loss), ::UNet, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}) @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0 [26] _pullback @ ~/model-zoo/vision/diffusion_mnist/diffusion_mnist.jl:265 [inlined] [27] _pullback(::Zygote.Context{true}, ::var"#12#14"{UNet}) @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface2.jl:0 [28] pullback(f::Function, ps::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}}) @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:373 [29] withgradient(f::Function, args::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}}) @ Zygote ~/.julia/packages/Zygote/dABKa/src/compiler/interface.jl:123 [30] train(; kws::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}) @ Main ~/model-zoo/vision/diffusion_mnist/diffusion_mnist.jl:264 [31] train() @ Main ~/model-zoo/vision/diffusion_mnist/diffusion_mnist.jl:222 [32] top-level scope @ REPL[3]:1 [33] top-level scope @ ~/.julia/packages/CUDA/DfvRa/src/initialization.jl:52
I did few more tests and
model | trains | cuda driver | cudnn | kernel driver | binary builder |
---|---|---|---|---|---|
diffusion_mnist | no | 11.8 | 8.30.2 (for CUDA 11.5.0) | 520.61.5 (for CUDA 11.8) | yes |
diffusion_mnist | no | 11.8 | 8.60.0 (for CUDA 11.8.0) | 520.61.5 (for CUDA 11.8) | no |
vae_mnist | yes | 11.8 | 8.30.2 (for CUDA 11.5.0) | 520.61.5 (for CUDA 11.8) | yes |
vae_mnist | yes | 11.8 | 8.60.0 (for CUDA 11.8.0) | 520.61.5 (for CUDA 11.8) | no |
vgg_cifar10 | yes/no* | 11.8 | 8.30.2 (for CUDA 11.5.0) | 520.61.5 (for CUDA 11.8) | yes |
vgg_cifar10 | yes | 11.8 | 8.60.0 (for CUDA 11.8.0) | 520.61.5 (for CUDA 11.8) | no |
conv_minst | yes | 11.8 | 8.30.2 (for CUDA 11.5.0) | 520.61.5 (for CUDA 11.8) | yes |
conv_minst | yes | 11.8 | 8.60.0 (for CUDA 11.8.0) | 520.61.5 (for CUDA 11.8) | no |
Regarding vgg_cifar10. First time I started julia session and updated packages, I got CUDNN_STATUS_INTERNAL_ERROR (code 4). But after restating Julia session and trying again, this model trains just fine. This is not the case with diffusion_mnist which never works and throws CUDNN_STATUS_BAD_PARAM (code 3).
diffusion_mnist trains on CPU also fine.
If you're able to narrow it down to the particular conv layer which is throwing the error, we can try to replicate those conditions in a MWE.
The model is broken because it uses negative padding which is not supported on GPU MWE below
tconv3=ConvTranspose((3, 3), 64 => 32, stride=2,pad=(0, -1, 0, -1), bias=false) |> gpu
X = randn(28,28,64,32) |> gpu
tconv3(X)
ERROR: CUDNNError: CUDNN_STATUS_BAD_PARAM (code 3)
Stacktrace:
[1] throw_api_error(res::CUDA.CUDNN.cudnnStatus_t)
But CPU version works as author indended
[size(ConvTranspose((3,3), 64=>32, pad=i)(X)) for i in [1,0,-1,-2]]
(28, 28, 32, 32)
(30, 30, 32, 32)
(32, 32, 32, 32)
(34, 34, 32, 32)
[size(Conv((3,3), 64=>32, pad=i)(X)) for i in [1,0,-1,-2]]
(28, 28, 32, 32)
(26, 26, 32, 32)
(24, 24, 32, 32)
(22, 22, 32, 32)
I don't know if this is same in the original model though, I doubt so, because PyTorch does not support negative padding (despite there has been discussions about supporting it).
torch.nn.Conv2d(64,32,(3,3),stride=2,padding=(-1,-1)).to("cuda")(X)
RuntimeError: negative padding is not supported
and not supported by NVIDIA CUDA https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnSetConvolutionNdDescriptor
CUDNN_STATUS_BAD_PARAM
- One of the elements of padA is strictly negative.