OOM when evaluating a small resnet (with both Flux and Knet)

jonathan-laurent commented 3 years ago

I have been encountering what I interpret as being OOM errors when training small resnets.

The problem happens with both Flux and Knet and with both types of memory pools (binned or split).

I am providing replication instructions below. Replicating the errors takes about a minute.

Here are different errors I have observed (see backtraces at the end of the message):

CUDNN_STATUS_EXECUTION_FAILED (code 8): when running with Flux
CUDNN_STATUS_INTERNAL_ERROR (code 4): when running with Knet (with probability ~0.5)
Surprisingly, when using Knet, I am sometimes getting the following error instead (with probability ~0.5): MethodError: no method matching LinearIndices(::Knet.KnetArrays.KnetVector{Float32})

Replication instructions

export JULIA_CUDA_MEMORY_POOL=binned #split
export ALPHAZERO_DEFAULT_DL_FRAMEWORK=FLUX #KNET

git clone --branch cuda-oom https://github.com/jonathan-laurent/AlphaZero.jl.git
cd AlphaZero.jl
julia --project -e "import Pkg; Pkg.instantiate()"

NUM_FILTERS=64 julia --project scripts/profile/debug_oom.jl

Configuration

Julia:

Julia Version 1.6.0-rc1
Commit a58bdd9010 (2021-02-06 15:49 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_CUDA_MEMORY_POOL = split
  JULIA_NUM_THREADS = 6

CUDA: Package version: v2.6.1

CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.1.0
NVIDIA driver 455.23.5

Libraries: 
- CUBLAS: 11.2.1
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+455.23.5
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.6.0-rc1
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_MEMORY_POOL: split

1 device:
  0: GeForce RTX 2070 (sm_75, 7.462 GiB / 7.793 GiB available)

Backtraces

When using Flux:

ERROR: LoadError: CUDNNError: CUDNN_STATUS_EXECUTION_FAILED (code 8)
Stacktrace:
  [1] throw_api_error(res::CUDA.CUDNN.cudnnStatus_t)
    @ CUDA.CUDNN ~/.julia/packages/CUDA/Zmd60/lib/cudnn/error.jl:19
  [2] macro expansion
    @ ~/.julia/packages/CUDA/Zmd60/lib/cudnn/error.jl:30 [inlined]
  [3] cudnnBatchNormalizationForwardTraining(handle::Ptr{Nothing}, mode::CUDA.CUDNN.cudnnBatchNormMode_t, alpha::Base.RefValue{Float32}, beta::Base.RefValue{Float32}, xDesc::CUDA.CUDNN.TensorDesc, x::CUDA.CuArray{Float32, 4}, yDesc::CUDA.CUDNN.TensorDesc, y::CUDA.CuArray{Float32, 4}, bnScaleBiasMeanVarDesc::CUDA.CUDNN.TensorDesc, bnScale::CUDA.CuArray{Float32, 1}, bnBias::CUDA.CuArray{Float32, 1}, exponentialAverageFactor::Float32, resultRunningMean::CUDA.CuArray{Float32, 1}, resultRunningVariance::CUDA.CuArray{Float32, 1}, epsilon::Float32, resultSaveMean::CUDA.CuPtr{Nothing}, resultSaveInvVariance::CUDA.CuPtr{Nothing})
    @ CUDA.CUDNN ~/.julia/packages/CUDA/Zmd60/lib/utils/call.jl:26
  [4] cudnnBNForward!(y::CUDA.CuArray{Float32, 4}, g::CUDA.CuArray{Float32, 1}, b::CUDA.CuArray{Float32, 1}, x::CUDA.CuArray{Float32, 4}, running_mean::CUDA.CuArray{Float32, 1}, running_var::CUDA.CuArray{Float32, 1}, momentum::Float32; cache::Nothing, alpha::Int64, beta::Int64, eps::Float32, training::Bool)
    @ CUDA.CUDNN ~/.julia/packages/CUDA/Zmd60/lib/cudnn/batchnorm.jl:53
  [5] #batchnorm#42
    @ ~/.julia/packages/CUDA/Zmd60/lib/cudnn/batchnorm.jl:25 [inlined]
  [6] #adjoint#17
    @ ~/.julia/packages/Flux/goUGu/src/cuda/cudnn.jl:6 [inlined]
  [7] _pullback(__context__::Zygote.Context, #unused#::CUDA.CUDNN.var"#batchnorm##kw", kw::NamedTuple{(:cache, :alpha, :beta, :eps, :training), Tuple{Nothing, Int64, Int64, Float32, Bool}}, 267::typeof(CUDA.CUDNN.batchnorm), g::CUDA.CuArray{Float32, 1}, b::CUDA.CuArray{Float32, 1}, x::CUDA.CuArray{Float32, 4}, running_mean::CUDA.CuArray{Float32, 1}, running_var::CUDA.CuArray{Float32, 1}, momentum::Float32)
    @ Flux.CUDAint ~/.julia/packages/ZygoteRules/OjfTt/src/adjoint.jl:63
  [8] _pullback
    @ ~/.julia/packages/Flux/goUGu/src/cuda/cudnn.jl:3 [inlined]
  [9] _pullback(::Zygote.Context, ::Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, CUDA.CuArray{Float32, 1}, Float32}, ::CUDA.CuArray{Float32, 4}, ::Nothing)
    @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface2.jl:0
 [10] _pullback
    @ ~/.julia/packages/Flux/goUGu/src/cuda/cudnn.jl:3 [inlined]
 [11] _pullback(ctx::Zygote.Context, f::Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, CUDA.CuArray{Float32, 1}, Float32}, args::CUDA.CuArray{Float32, 4})
    @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface2.jl:0
 [12] _pullback
    @ ~/.julia/packages/Flux/goUGu/src/layers/basic.jl:36 [inlined]
 [13] _pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, CUDA.CuArray{Float32, 1}, Float32}, Flux.Chain{Tuple{Flux.SkipConnection, AlphaZero.FluxLib.var"#17#18"}}, Flux.Chain{Tuple{Flux.SkipConnection, AlphaZero.FluxLib.var"#17#18"}}, Flux.Chain{Tuple{Flux.SkipConnection, AlphaZero.FluxLib.var"#17#18"}}, Flux.Chain{Tuple{Flux.SkipConnection, AlphaZero.FluxLib.var"#17#18"}}, Flux.Chain{Tuple{Flux.SkipConnection, AlphaZero.FluxLib.var"#17#18"}}}, ::CUDA.CuArray{Float32, 4})
    @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface2.jl:0
 [14] _pullback
    @ ~/.julia/packages/Flux/goUGu/src/layers/basic.jl:36 [inlined]
 [15] _pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, CUDA.CuArray{Float32, 1}, Float32}, Flux.Chain{Tuple{Flux.SkipConnection, AlphaZero.FluxLib.var"#17#18"}}, Flux.Chain{Tuple{Flux.SkipConnection, AlphaZero.FluxLib.var"#17#18"}}, Flux.Chain{Tuple{Flux.SkipConnection, AlphaZero.FluxLib.var"#17#18"}}, Flux.Chain{Tuple{Flux.SkipConnection, AlphaZero.FluxLib.var"#17#18"}}, Flux.Chain{Tuple{Flux.SkipConnection, AlphaZero.FluxLib.var"#17#18"}}}, ::CUDA.CuArray{Float32, 4})
    @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface2.jl:0
 [16] _pullback
    @ ~/.julia/packages/Flux/goUGu/src/layers/basic.jl:38 [inlined]
 [17] _pullback(ctx::Zygote.Context, f::Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, CUDA.CuArray{Float32, 1}, Float32}, Flux.Chain{Tuple{Flux.SkipConnection, AlphaZero.FluxLib.var"#17#18"}}, Flux.Chain{Tuple{Flux.SkipConnection, AlphaZero.FluxLib.var"#17#18"}}, Flux.Chain{Tuple{Flux.SkipConnection, AlphaZero.FluxLib.var"#17#18"}}, Flux.Chain{Tuple{Flux.SkipConnection, AlphaZero.FluxLib.var"#17#18"}}, Flux.Chain{Tuple{Flux.SkipConnection, AlphaZero.FluxLib.var"#17#18"}}}}, args::CUDA.CuArray{Float32, 4})
    @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface2.jl:0
 [18] _pullback
    @ ~/AlphaZero.jl/src/networks/flux.jl:160 [inlined]
 [19] _pullback(::Zygote.Context, ::typeof(AlphaZero.Network.forward), ::ResNet, ::CUDA.CuArray{Float32, 4})
    @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface2.jl:0
 [20] _pullback
    @ ~/AlphaZero.jl/src/networks/network.jl:260 [inlined]
 [21] _pullback(::Zygote.Context, ::typeof(AlphaZero.Network.forward_normalized), ::ResNet, ::CUDA.CuArray{Float32, 4}, ::CUDA.CuArray{Float32, 2})
    @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface2.jl:0
 [22] _pullback
    @ ~/AlphaZero.jl/src/learning.jl:70 [inlined]
 [23] _pullback(::Zygote.Context, ::typeof(AlphaZero.losses), ::ResNet, ::LearningParams, ::Float32, ::Float32, ::Tuple{CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}})
    @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface2.jl:0
 [24] _pullback
    @ ~/AlphaZero.jl/src/learning.jl:122 [inlined]
 [25] _pullback(::Zygote.Context, ::AlphaZero.var"#L#110"{AlphaZero.Trainer}, ::CUDA.CuArray{Float32, 2}, ::CUDA.CuArray{Float32, 4}, ::CUDA.CuArray{Float32, 2}, ::CUDA.CuArray{Float32, 2}, ::CUDA.CuArray{Float32, 2})
    @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface2.jl:0
 [26] adjoint
    @ ~/.julia/packages/Zygote/KpME9/src/lib/lib.jl:188 [inlined]
 [27] _pullback
    @ ~/.julia/packages/ZygoteRules/OjfTt/src/adjoint.jl:57 [inlined]
 [28] _pullback
    @ ~/AlphaZero.jl/src/networks/flux.jl:82 [inlined]
 [29] _pullback(::Zygote.Context, ::AlphaZero.FluxLib.var"#1#2"{AlphaZero.var"#L#110"{AlphaZero.Trainer}, Tuple{CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}}})
    @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface2.jl:0
 [30] pullback(f::Function, ps::Zygote.Params)
    @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface.jl:167
 [31] lossgrads(f::Function, args::Zygote.Params)
    @ AlphaZero.FluxLib ~/AlphaZero.jl/src/networks/flux.jl:72
 [32] train!(callback::AlphaZero.var"#109#111"{Vector{Float32}}, nn::ResNet, opt::Adam, loss::Function, data::Base.Iterators.Take{Base.Iterators.Stateful{Base.Iterators.Flatten{Base.Generator{Base.Iterators.Repeated{Nothing}, AlphaZero.Util.var"#12#13"{AlphaZero.var"#106#108"{ResNet}, Tuple{Matrix{Float32}, Array{Float32, 4}, Matrix{Float32}, Matrix{Float32}, Matrix{Float32}}, Int64, Bool}}}, Tuple{NTuple{5, Any}, Tuple{Nothing, Base.Generator{Vector{Tuple{Matrix{Float32}, Array{Float32, 4}, Matrix{Float32}, Matrix{Float32}, Matrix{Float32}}}, AlphaZero.Util.var"#9#11"{AlphaZero.var"#106#108"{ResNet}}}, Int64}}}}, n::Int64)
    @ AlphaZero.FluxLib ~/AlphaZero.jl/src/networks/flux.jl:81
 [33] batch_updates!(tr::AlphaZero.Trainer, n::Int64)
    @ AlphaZero ~/AlphaZero.jl/src/learning.jl:125
 [34] macro expansion
    @ ./timing.jl:356 [inlined]
 [35] learning_step!(env::Env{AlphaZero.Examples.ConnectFour.GameSpec, ResNet, NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}, handler::Session{Env{AlphaZero.Examples.ConnectFour.GameSpec, ResNet, NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}})
    @ AlphaZero ~/AlphaZero.jl/src/training.jl:223
 [36] top-level scope
    @ ~/AlphaZero.jl/scripts/profile/debug_oom.jl:39
in expression starting at /home/jonathan/AlphaZero.jl/scripts/profile/debug_oom.jl:39

When using Knet (1/2):

ERROR: CUDNNError: CUDNN_STATUS_INTERNAL_ERROR (code 4)
Stacktrace:
  [1] throw_api_error(res::CUDA.CUDNN.cudnnStatus_t)
    @ CUDA.CUDNN ~/.julia/packages/CUDA/Zmd60/lib/cudnn/error.jl:19
  [2] macro expansion
    @ ~/.julia/packages/CUDA/Zmd60/lib/cudnn/error.jl:30 [inlined]
  [3] cudnnFindConvolutionBackwardFilterAlgorithmEx(handle::Ptr{Nothing}, xDesc::Knet.Ops20_gpu.TD, x::Knet.KnetArrays.KnetArray{Float32, 4}, dyDesc::Knet.Ops20_gpu.TD, y::Knet.KnetArrays.KnetArray{Float32, 4}, convDesc::Knet.Ops20_gpu.CD, dwDesc::Knet.Ops20_gpu.FD, dw::Knet.KnetArrays.KnetArray{Float32, 4}, requestedAlgoCount::Int64, returnedAlgoCount::Vector{Int32}, perfResults::Vector{CUDA.CUDNN.cudnnConvolutionBwdFilterAlgoPerf_t}, workSpace::Knet.KnetArrays.KnetVector{Float32}, workSpaceSizeInBytes::Int64)
    @ CUDA.CUDNN ~/.julia/packages/CUDA/Zmd60/lib/utils/call.jl:26
  [4] conv4w_algo(w::Knet.KnetArrays.KnetArray{Float32, 4}, x::Knet.KnetArrays.KnetArray{Float32, 4}, dy::Knet.KnetArrays.KnetArray{Float32, 4}, dw::Knet.KnetArrays.KnetArray{Float32, 4}; handle::Ptr{Nothing}, o::Base.Iterators.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:padding,), Tuple{Int64}}})
    @ Knet.Ops20_gpu ~/.julia/packages/Knet/C0PoK/src/ops20_gpu/conv.jl:194
  [5] conv4w(w::Knet.KnetArrays.KnetArray{Float32, 4}, x::Knet.KnetArrays.KnetArray{Float32, 4}, dy::Knet.KnetArrays.KnetArray{Float32, 4}; handle::Ptr{Nothing}, alpha::Int64, o::Base.Iterators.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:padding,), Tuple{Int64}}})
    @ Knet.Ops20_gpu ~/.julia/packages/Knet/C0PoK/src/ops20_gpu/conv.jl:27
  [6] forw(::Function, ::AutoGrad.Param{Knet.KnetArrays.KnetArray{Float32, 4}}, ::Vararg{Any, N} where N; kwargs::Base.Iterators.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:padding,), Tuple{Int64}}})
    @ AutoGrad ~/.julia/packages/AutoGrad/TTpeo/src/core.jl:66
  [7] #conv4w#47
    @ ./none:0 [inlined]
  [8] #back#23
    @ ./none:0 [inlined]
  [9] differentiate(::Function; o::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ AutoGrad ~/.julia/packages/AutoGrad/TTpeo/src/core.jl:165
 [10] differentiate
    @ ~/.julia/packages/AutoGrad/TTpeo/src/core.jl:135 [inlined]
 [11] iterate
    @ ~/.julia/packages/Knet/C0PoK/src/train20/train.jl:26 [inlined]
 [12] iterate
    @ ./iterators.jl:159 [inlined]
 [13] iterate
    @ ./iterators.jl:158 [inlined]
 [14] train!(callback::AlphaZero.var"#109#111"{Vector{Float32}}, nn::ResNet, opt::Adam, loss::Function, data::Base.Iterators.Take{Base.Iterators.Stateful{Base.Iterators.Flatten{Base.Generator{Base.Iterators.Repeated{Nothing}, AlphaZero.Util.var"#12#13"{AlphaZero.var"#106#108"{ResNet}, Tuple{Matrix{Float32}, Array{Float32, 4}, Matrix{Float32}, Matrix{Float32}, Matrix{Float32}}, Int64, Bool}}}, Tuple{Tuple{Union{Matrix{Float32}, Knet.KnetArrays.KnetMatrix{Float32}}, Union{Array{Float32, 4}, Knet.KnetArrays.KnetArray{Float32, 4}}, Union{Matrix{Float32}, Knet.KnetArrays.KnetMatrix{Float32}}, Union{Matrix{Float32}, Knet.KnetArrays.KnetMatrix{Float32}}, Union{Matrix{Float32}, Knet.KnetArrays.KnetMatrix{Float32}}}, Tuple{Nothing, Base.Generator{Vector{Tuple{Matrix{Float32}, Array{Float32, 4}, Matrix{Float32}, Matrix{Float32}, Matrix{Float32}}}, AlphaZero.Util.var"#9#11"{AlphaZero.var"#106#108"{ResNet}}}, Int64}}}}, n::Int64)
    @ AlphaZero.KnetLib ~/AlphaZero.jl/src/networks/knet.jl:120
 [15] batch_updates!(tr::AlphaZero.Trainer, n::Int64)
    @ AlphaZero ~/AlphaZero.jl/src/learning.jl:125
 [16] macro expansion
    @ ./timing.jl:356 [inlined]
 [17] learning_step!(env::Env{AlphaZero.Examples.ConnectFour.GameSpec, ResNet, NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}, handler::Session{Env{AlphaZero.Examples.ConnectFour.GameSpec, ResNet, NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}})
    @ AlphaZero ~/AlphaZero.jl/src/training.jl:223
 [18] macro expansion
    @ ./timing.jl:356 [inlined]
 [19] macro expansion
    @ ~/AlphaZero.jl/src/report.jl:267 [inlined]
 [20] train!(env::Env{AlphaZero.Examples.ConnectFour.GameSpec, ResNet, NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}, handler::Session{Env{AlphaZero.Examples.ConnectFour.GameSpec, ResNet, NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}})
    @ AlphaZero ~/AlphaZero.jl/src/training.jl:326
 [21] resume!(session::Session{Env{AlphaZero.Examples.ConnectFour.GameSpec, ResNet, NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}})
    @ AlphaZero.UserInterface ~/AlphaZero.jl/src/ui/session.jl:316
 [22] train(e::Experiment; args::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ AlphaZero.Scripts ~/AlphaZero.jl/src/scripts/scripts.jl:26
 [23] train
    @ ~/AlphaZero.jl/src/scripts/scripts.jl:26 [inlined]
 [24] #train#15
    @ ~/AlphaZero.jl/src/scripts/scripts.jl:28 [inlined]
 [25] train(s::String)
    @ AlphaZero.Scripts ~/AlphaZero.jl/src/scripts/scripts.jl:28
 [26] top-level scope
    @ none:1

When using Knet (2/2):

ERROR: MethodError: no method matching LinearIndices(::Knet.KnetArrays.KnetVector{Float32})
Closest candidates are:
  LinearIndices(::Tuple{}) at indices.jl:451
  LinearIndices(::R) where {N, R<:Tuple{Vararg{AbstractUnitRange{Int64}, N}}} at indices.jl:448
  LinearIndices(::Tuple{Vararg{AbstractUnitRange{var"#s77"} where var"#s77"<:Integer, N}}) where N at indices.jl:452
  ...
Stacktrace:
  [1] compute_linindex
    @ ./subarray.jl:395 [inlined]
  [2] compute_offset1
    @ ./subarray.jl:387 [inlined]
  [3] compute_offset1
    @ ./subarray.jl:385 [inlined]
  [4] SubArray
    @ ./subarray.jl:38 [inlined]
  [5] SubArray
    @ ~/.julia/packages/Knet/C0PoK/src/knetarrays/dotview.jl:37 [inlined]
  [6] unsafe_view
    @ ~/.julia/packages/Knet/C0PoK/src/knetarrays/dotview.jl:21 [inlined]
  [7] view
    @ ~/.julia/packages/Knet/C0PoK/src/knetarrays/dotview.jl:16 [inlined]
  [8] dotview(A::Knet.KnetArrays.KnetMatrix{Float32}, I::Function)
    @ Knet.KnetArrays ~/.julia/packages/Knet/C0PoK/src/knetarrays/dotview.jl:10
  [9] fill!(a::Knet.KnetArrays.KnetMatrix{Float32}, x::Float32)
    @ Knet.KnetArrays ~/.julia/packages/Knet/C0PoK/src/knetarrays/abstractarray.jl:13
 [10] sum(x::Knet.KnetArrays.KnetMatrix{Float32}; dims::Vector{Any})
    @ Knet.KnetArrays ~/.julia/packages/Knet/C0PoK/src/knetarrays/reduction.jl:41
 [11] unbroadcast(x::AutoGrad.Param{Knet.KnetArrays.KnetVector{Float32}}, dx::Knet.KnetArrays.KnetMatrix{Float32})
    @ AutoGrad ~/.julia/packages/AutoGrad/TTpeo/src/unbroadcast.jl:24
 [12] back(#unused#::typeof(Base.Broadcast.broadcasted), #unused#::Type{AutoGrad.Arg{3}}, dy::Knet.KnetArrays.KnetMatrix{Float32}, 269::AutoGrad.Result{Knet.KnetArrays.KnetMatrix{Float32}}, #unused#::typeof(+), x1::AutoGrad.Result{Knet.KnetArrays.KnetMatrix{Float32}}, x2::AutoGrad.Param{Knet.KnetArrays.KnetVector{Float32}})
    @ AutoGrad ./none:0
 [13] differentiate(::Function; o::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ AutoGrad ~/.julia/packages/AutoGrad/TTpeo/src/core.jl:165
 [14] differentiate
    @ ~/.julia/packages/AutoGrad/TTpeo/src/core.jl:135 [inlined]
 [15] iterate
    @ ~/.julia/packages/Knet/C0PoK/src/train20/train.jl:26 [inlined]
 [16] iterate
    @ ./iterators.jl:159 [inlined]
 [17] iterate
    @ ./iterators.jl:158 [inlined]
 [18] train!(callback::AlphaZero.var"#109#111"{Vector{Float32}}, nn::ResNet, opt::Adam, loss::Function, data::Base.Iterators.Take{Base.Iterators.Stateful{Base.Iterators.Flatten{Base.Generator{Base.Iterators.Repeated{Nothing}, AlphaZero.Util.var"#12#13"{AlphaZero.var"#106#108"{ResNet}, Tuple{Matrix{Float32}, Array{Float32, 4}, Matrix{Float32}, Matrix{Float32}, Matrix{Float32}}, Int64, Bool}}}, Tuple{Tuple{Union{Matrix{Float32}, Knet.KnetArrays.KnetMatrix{Float32}}, Union{Array{Float32, 4}, Knet.KnetArrays.KnetArray{Float32, 4}}, Union{Matrix{Float32}, Knet.KnetArrays.KnetMatrix{Float32}}, Union{Matrix{Float32}, Knet.KnetArrays.KnetMatrix{Float32}}, Union{Matrix{Float32}, Knet.KnetArrays.KnetMatrix{Float32}}}, Tuple{Nothing, Base.Generator{Vector{Tuple{Matrix{Float32}, Array{Float32, 4}, Matrix{Float32}, Matrix{Float32}, Matrix{Float32}}}, AlphaZero.Util.var"#9#11"{AlphaZero.var"#106#108"{ResNet}}}, Int64}}}}, n::Int64)
    @ AlphaZero.KnetLib ~/AlphaZero.jl/src/networks/knet.jl:119
 [19] batch_updates!(tr::AlphaZero.Trainer, n::Int64)
    @ AlphaZero ~/AlphaZero.jl/src/learning.jl:125
 [20] macro expansion
    @ ./timing.jl:356 [inlined]
 [21] learning_step!(env::Env{AlphaZero.Examples.ConnectFour.GameSpec, ResNet, NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}, handler::Session{Env{AlphaZero.Examples.ConnectFour.GameSpec, ResNet, NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}})
    @ AlphaZero ~/AlphaZero.jl/src/training.jl:223
 [22] macro expansion
    @ ./timing.jl:356 [inlined]
 [23] macro expansion
    @ ~/AlphaZero.jl/src/report.jl:267 [inlined]
 [24] train!(env::Env{AlphaZero.Examples.ConnectFour.GameSpec, ResNet, NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}, handler::Session{Env{AlphaZero.Examples.ConnectFour.GameSpec, ResNet, NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}})
    @ AlphaZero ~/AlphaZero.jl/src/training.jl:326
 [25] resume!(session::Session{Env{AlphaZero.Examples.ConnectFour.GameSpec, ResNet, NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}})
    @ AlphaZero.UserInterface ~/AlphaZero.jl/src/ui/session.jl:316
 [26] train(e::Experiment; args::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ AlphaZero.Scripts ~/AlphaZero.jl/src/scripts/scripts.jl:26
 [27] train
    @ ~/AlphaZero.jl/src/scripts/scripts.jl:26 [inlined]
 [28] #train#15
    @ ~/AlphaZero.jl/src/scripts/scripts.jl:28 [inlined]
 [29] train(s::String)
    @ AlphaZero.Scripts ~/AlphaZero.jl/src/scripts/scripts.jl:28
 [30] top-level scope
    @ none:1

DrChainsaw commented 3 years ago

NVIDIA driver 455.23.5

Shot in the dark: Update your graphics driver.

I also ran into all kinds of OOM-like errors just now after updating from CUDA 2.1.6 to 2.4.1 and after updating to 461.40 I have yet to see one.

Im on windows 10 btw.

jonathan-laurent commented 3 years ago

I updated my GPU driver to 460.32.3, which is the version included in the latest CUDA Toolkit and I am still encountering the same problem. Is anyone else able to replicate the errors I am getting on their setup?

Also, I wanted to try and use CUDA#master but I got the following runtime error:

julia: symbol lookup error: /home/jonathan/.julia/artifacts/e99dab5d7bdf5b60da265bae5e949189d907a56b/lib/libcublas.so.11: undefined symbol: cublasLtSSSMatmulAlgoGetHeuristic, version libcublasLt.so.11

Updated CUDA.versioninfo()

CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.32.3

Libraries: 
- CUBLAS: 11.2.1
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+460.32.3
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.6.0-rc1
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_MEMORY_POOL: split

1 device:
  0: GeForce RTX 2070 (sm_75, 7.469 GiB / 7.793 GiB available)

maleadt commented 3 years ago

julia: symbol lookup error: /home/jonathan/.julia/artifacts/e99dab5d7bdf5b60da265bae5e949189d907a56b/lib/libcublas.so.11: undefined symbol: cublasLtSSSMatmulAlgoGetHeuristic, version libcublasLt.so.11

That's weird, haven't seen that. Are you on latest master? It should select CUDA 11.2, based on your driver. But I don't think it should affect the OOM reported here. I can have a look later, but if you run with julia -g2 you should get a listing of outstanding allocations when you go OOM. Not very user-friendly, but might help.

jonathan-laurent commented 3 years ago

Are you on latest master?

I just typed add CUDA#master in the REPL.

Also, the versioninfo() above is with 2.6.1.

I can have a look later, but if you run with julia -g2 you should get a listing of outstanding allocations when you go OOM.

I tried this on 2.6.1 and it had no effect. Note that the errors I am getting here are not explicitly OOM errors. In my past experience, such errors were often caused by OOM issues but something different may be at play here (especially since the problem happens with tiny networks).

jonathan-laurent commented 3 years ago

Update: Knet's MethodError may be a separate issue as a friend of mine managed to replicate it on their computer but did not manage to replicate the CUDNN errors (they use a similar configuration as mine, except Windows + 461.40 driver).

Thus, I filed a separate issue.

maleadt commented 3 years ago

Trying with the latest CUDA, Flux and NNlib I'm not seeing this specific issue. It did reproduce using the committed Manifest though, so are you sure those master branches are up to date?

I tried this on 2.6.1 and it had no effect. Note that the errors I am getting here are not explicitly OOM errors. In my past experience, such errors were often caused by OOM issues but something different may be at play here (especially since the problem happens with tiny networks).

You can try the following:

diff --git a/lib/cudnn/error.jl b/lib/cudnn/error.jl
index 5981bb7c..a0501f5b 100644
--- a/lib/cudnn/error.jl
+++ b/lib/cudnn/error.jl
@@ -25,7 +25,10 @@ end

 macro check(ex)
     quote
-        res = @retry_reclaim isequal(CUDNN_STATUS_ALLOC_FAILED) $(esc(ex))
+        res = @retry_reclaim(err->isequal(err, CUDNN_STATUS_ALLOC_FAILED) ||
+                                  isequal(err, CUDNN_STATUS_INTERNAL_ERROR) ||
+                                  isequal(err, CUDNN_STATUS_EXECUTION_FAILED),
+                             $(esc(ex)))
         if res != CUDNN_STATUS_SUCCESS
             throw_api_error(res)
         end

I don't think it is valid to put this in CUDA.jl, but it might be good for testing. In short, it'll retry API calls that fail with those statusses you've seen here, freeing more and more memory each time it retries.

And just to elaborate, it's not that these small networks require a lot of memory, but we just don't get to reuse past allocations if the Julia GC doesn't kick in (which it doesn't, because there's no CPU memory pressure and the Julia GC doesn't know about GPU memory pressure). As a result, we inch closer to OOM, and libraries like CUDNN don't like that. But instead of failing with a nice CUDNN_ALLOC_FAILED, it fails with various statusses which we don't properly detect as an OOM situation.

There's multiple solutions. Either we retry API calls when they fail, as the above snippet does, but that may be invalid (did it abort in the middle of a stateful computation?). Alternative, we could ensure a certain amount of memory is always available; IIUC that's what other frameworks do. In addition, we could do away with caching memory, which will be an option on CUDA 11.2+ (see https://github.com/JuliaGPU/CUDA.jl/pull/679). We don't yet because it breaks device_reset!, but maybe I should make it available as a hidden option.

jonathan-laurent commented 3 years ago

Trying with the latest CUDA, Flux and NNlib

Do you mean using the master versions of all these libraries? Because I believe I am already using the latest published releases.

And just to elaborate, it's not that these small networks require a lot of memory, but we just don't get to reuse past allocations if the Julia GC doesn't kick in

This is what I was thinking but then, how do you explain that everyone does not get bitten by this all the time? There is really nothing special in the resnet training example above and so I would expect many people to be running into the same issue.

maleadt commented 3 years ago

Do you mean using the master versions of all these libraries?

Yes, master branches.

how do you explain that everyone does not get bitten by this all the time

API calls that fail with an explicit allocation failure are caught and retried after collecting garbage. In your case, for whatever reason, you're mostly triggering other failures that are related to but not classified as an OOM. Not sure why that is, though.

maleadt commented 3 years ago

I think this is the logic Tensorflow uses: https://github.com/tensorflow/tensorflow/blob/8a998b32138ade2da1c14c7e52279133d3fddf55/tensorflow/core/common_runtime/gpu/gpu_device.cc#L910-L950 Not sure how to reuse that, though, since we don't allocate memory upfront. And only ensuring that the memory pool doesn't exceed TOTAL_MEMORY-RESERVED_MEMORY doesn't work on a system with multiple users. But maybe that's already better than what we do now....

maleadt commented 3 years ago

https://github.com/JuliaGPU/CUDA.jl/pull/718

jonathan-laurent commented 3 years ago

I tried again with #master versions of CUDA, Flux and NNlib. I get the following error then:

ERROR: LoadError: CUBLASError: an absent device architectural feature is required (code 8, CUBLAS_STATUS_ARCH_MISMATCH)
Stacktrace:
  [1] throw_api_error(res::CUDA.CUBLAS.cublasStatus_t)
    @ CUDA.CUBLAS ~/.julia/packages/CUDA/kU5rX/lib/cublas/error.jl:47
  [2] macro expansion
    @ ~/.julia/packages/CUDA/kU5rX/lib/cublas/error.jl:58 [inlined]
  [3] cublasGemmEx(handle::Ptr{Nothing}, transa::Char, transb::Char, m::Int64, n::Int64, k::Int64, alpha::Base.RefValue{Float32}, A::CUDA.CuArray{Float32, 2}, Atype::Type, lda::Int64, B::CUDA.CuArray{Float32, 2}, Btype::Type, ldb::Int64, beta::Base.RefValue{Float32}, C::CUDA.CuArray{Float32, 2}, Ctype::Type, ldc::Int64, computeType::CUDA.CUBLAS.cublasComputeType_t, algo::CUDA.CUBLAS.cublasGemmAlgo_t)
    @ CUDA.CUBLAS ~/.julia/packages/CUDA/kU5rX/lib/utils/call.jl:26
  [4] gemmEx!(transA::Char, transB::Char, alpha::Number, A::Union{CUDA.CuVecOrMat{T}, CUDA.DenseCuVecOrMat{T}} where T, B::Union{CUDA.CuVecOrMat{T}, CUDA.DenseCuVecOrMat{T}} where T, beta::Number, C::Union{CUDA.CuVecOrMat{T}, CUDA.DenseCuVecOrMat{T}} where T; algo::CUDA.CUBLAS.cublasGemmAlgo_t)
    @ CUDA.CUBLAS ~/.julia/packages/CUDA/kU5rX/lib/cublas/wrappers.jl:837
  [5] gemmEx!
    @ ~/.julia/packages/CUDA/kU5rX/lib/cublas/wrappers.jl:819 [inlined]
  [6] gemm_dispatch!(C::CUDA.CuArray{Float32, 2}, A::CUDA.CuArray{Float32, 2}, B::CUDA.CuArray{Float32, 2}, alpha::Bool, beta::Bool)
    @ CUDA.CUBLAS ~/.julia/packages/CUDA/kU5rX/lib/cublas/linalg.jl:222
  [7] mul!
    @ ~/.julia/packages/CUDA/kU5rX/lib/cublas/linalg.jl:233 [inlined]
  [8] mul!
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:275 [inlined]
  [9] *
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:160 [inlined]
 [10] (::Flux.Dense{typeof(NNlib.relu), CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 1}})(x::CUDA.CuArray{Float32, 2})
    @ Flux ~/.julia/packages/Flux/fUe2N/src/layers/basic.jl:126
 [11] applychain
    @ ~/.julia/packages/Flux/fUe2N/src/layers/basic.jl:36 [inlined]
 [12] (::Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1}, Float32, CUDA.CuArray{Float32, 1}}, typeof(Flux.flatten), Flux.Dense{typeof(NNlib.relu), CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 1}}, Flux.Dense{typeof(tanh), CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 1}}}})(x::CUDA.CuArray{Float32, 4})
    @ Flux ~/.julia/packages/Flux/fUe2N/src/layers/basic.jl:38
 [13] forward(nn::ResNet, state::CUDA.CuArray{Float32, 4})
    @ AlphaZero.FluxLib ~/AlphaZero.jl/src/networks/flux.jl:161
 [14] forward_normalized(nn::ResNet, state::CUDA.CuArray{Float32, 4}, actions_mask::CUDA.CuArray{Float32, 2})
    @ AlphaZero.Network ~/AlphaZero.jl/src/networks/network.jl:260
 [15] losses(nn::ResNet, params::LearningParams, Wmean::Float32, Hp::Float32, ::Tuple{CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}})
    @ AlphaZero ~/AlphaZero.jl/src/learning.jl:70
 [16] learning_status(tr::AlphaZero.Trainer, samples::Tuple{CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}})
    @ AlphaZero ~/AlphaZero.jl/src/learning.jl:151
 [17] (::AlphaZero.var"#127#130"{AlphaZero.Trainer})(batch::Tuple{CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}})
    @ AlphaZero ./none:0
 [18] iterate
    @ ./generator.jl:47 [inlined]
 [19] collect(itr::Base.Generator{Base.Generator{Vector{Tuple{Matrix{Float32}, Array{Float32, 4}, Matrix{Float32}, Matrix{Float32}, Matrix{Float32}}}, AlphaZero.Util.var"#9#11"{AlphaZero.var"#126#129"{AlphaZero.Trainer}}}, AlphaZero.var"#127#130"{AlphaZero.Trainer}})
    @ Base ./array.jl:678
 [20] learning_status(tr::AlphaZero.Trainer)
    @ AlphaZero ~/AlphaZero.jl/src/learning.jl:166
 [21] learning_step!(env::Env{AlphaZero.Examples.ConnectFour.GameSpec, ResNet, NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}, handler::Session{Env{AlphaZero.Examples.ConnectFour.GameSpec, ResNet, NamedTuple{(:board, :curplayer), Tuple{StaticArrays.SMatrix{7, 6, UInt8, 42}, UInt8}}}})
    @ AlphaZero ~/AlphaZero.jl/src/training.jl:208
 [22] top-level scope
    @ ~/AlphaZero.jl/scripts/profile/debug_oom.jl:39
in expression starting at /home/jonathan/AlphaZero.jl/scripts/profile/debug_oom.jl:39

CUDA.versioninfo():

CUDA toolkit 11.2.0, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.32.3

Libraries: 
- CUBLAS: 11.2.1
- CURAND: 10.2.3
- CUFFT: 10.4.0
- CUSOLVER: 11.0.2
- CUSPARSE: 11.3.1
- CUPTI: 14.0.0
- NVML: 11.0.0+460.32.3
- CUDNN: 8.10.0 (for CUDA 11.2.0)
- CUTENSOR: 1.2.2 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.6.0-rc1
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_MEMORY_POOL: split

1 device:
  0: GeForce RTX 2070 (sm_75, 7.473 GiB / 7.793 GiB available)

maleadt commented 3 years ago

Add some details to https://github.com/JuliaGPU/CUDA.jl/issues/609 about that, it's probably a bug in CUDA.

jonathan-laurent commented 3 years ago

Add some details to #609 about that, it's probably a bug in CUDA.

I ran the CUBLAS tests on my configuration:

JULIA_DEBUG=CUBLAS julia --project -e 'using Pkg; Pkg.test("CUDA"; test_args=`cublas`)'

and got the following error:

     Testing Running tests...
I! cuBLAS (v11.2) function cublasStatus_t cublasCreate_v2(cublasContext**) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0x7ffe242b6310)
i! Time: 2021-02-17T12:23:12 elapsed from start 0.016667 minutes or 1.000000 seconds
i!Process=77213; Thread=140676450244416; GPU=0; Handle=POINTER (IN HEX:0x(nil))
i! COMPILED WITH: GNU GCC/G++ / 5.3.1 20160406 (Red Hat 5.3.1-6)
I! cuBLAS (v11.2) function cublasStatus_t cublasSetStream_v2(cublasHandle_t, cudaStream_t) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0xc989360)
i!  streamId: type=SOME TYPE; val=POINTER (IN HEX:0x0x53dfab0)
i! Time: 2021-02-17T12:23:13 elapsed from start 0.033333 minutes or 2.000000 seconds
i!Process=77213; Thread=140676450244416; GPU=0; Handle=POINTER (IN HEX:0x0xc989360); StreamId=POINTER (IN HEX:0x(nil)) (defaultStream); MathMode=CUBLAS_DEFAULT_MATH
i! COMPILED WITH: GNU GCC/G++ / 5.3.1 20160406 (Red Hat 5.3.1-6)
I! cuBLAS (v11.2) function cublasStatus_t cublasSetMathMode(cublasHandle_t, cublasMath_t) called:
i!  handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0xc989360)
i!  mode: type=cublasMath_t; val=CUBLAS_DEFAULT_MATH | CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION(16)
i! Time: 2021-02-17T12:23:13 elapsed from start 0.033333 minutes or 2.000000 seconds
i!Process=77213; Thread=140676450244416; GPU=0; Handle=POINTER (IN HEX:0x0xc989360); StreamId=POINTER (IN HEX:0x0x53dfab0); MathMode=CUBLAS_DEFAULT_MATH
i! COMPILED WITH: GNU GCC/G++ / 5.3.1 20160406 (Red Hat 5.3.1-6)
┌ Info: System information:
│ CUDA toolkit 11.2.0, artifact installation
│ CUDA driver 11.2.0
│ NVIDIA driver 460.32.3
│ 
│ Libraries: 
│ - CUBLAS: 11.2.1
│ - CURAND: 10.2.3
│ - CUFFT: 10.4.0
│ - CUSOLVER: 11.0.2
│ - CUSPARSE: 11.3.1
│ - CUPTI: 14.0.0
│ - NVML: 11.0.0+460.32.3
│ - CUDNN: 8.10.0 (for CUDA 11.2.0)
│ - CUTENSOR: 1.2.2 (for CUDA 11.1.0)
│ 
│ Toolchain:
│ - Julia: 1.6.0-rc1
│ - LLVM: 11.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
│ 
│ Environment:
│ - JULIA_CUDA_MEMORY_POOL: split
│ 
│ 1 device:
└   0: GeForce RTX 2070 (sm_75, 7.469 GiB / 7.793 GiB available)
[ Info: Testing using 1 device(s): 1. GeForce RTX 2070 (UUID f937dba2-c8c0-a9a2-be35-ab00ac6ae658)
               |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test  (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
      From worker 2:    /home/jonathan/Software/julia-1.6.0-rc1/bin/julia: symbol lookup error: /home/jonathan/.julia/artifacts/e99dab5d7bdf5b60da265bae5e949189d907a56b/lib/libcublas.so.11: undefined symbol: cublasLtSSSMatmulAlgoGetHeuristic, version libcublasLt.so.11
Worker 2 terminated.
cublas     (2) |         failed at 2021-02-17T12:23:49.261
cublas: Error During Test at none:1
  Test threw exception
  Expression: cublas
  ProcessExitedException(2)

Test Summary: | Error  Total
  Overall     |     1      1
    cublas    |     1      1
    FAILURE

Error in testset cublas:
Error During Test at none:1
  Test threw exception
  Expression: cublas
  ProcessExitedException(2)
ERROR: LoadError: Test run finished with errors
in expression starting at /home/jonathan/.julia/packages/CUDA/kU5rX/test/runtests.jl:487
ERROR: Package CUDA errored during testing
Stacktrace:
 [1] pkgerror(msg::String)
   @ Pkg.Types /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/Types.jl:55
 [2] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, julia_args::Cmd, test_args::Cmd, test_fn::Nothing)
   @ Pkg.Operations /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/Operations.jl:1687
 [3] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, test_fn::Nothing, julia_args::Cmd, test_args::Cmd, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Pkg.API /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:336
 [4] #test#62
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:73 [inlined]
 [5] #test#61
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:70 [inlined]
 [6] test(pkg::String; kwargs::Base.Iterators.Pairs{Symbol, Cmd, Tuple{Symbol}, NamedTuple{(:test_args,), Tuple{Cmd}}})
   @ Pkg.API /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Pkg/src/API.jl:69
 [7] top-level scope
   @ none:1

maleadt commented 3 years ago

Can you try with the latest master? CUDA has been updated to v11.2, and I can't find anything about that symbol (I haven't seen that issue before).

jonathan-laurent commented 3 years ago

I got the error above using #master. I just updated my environment again and got the same result.

maleadt commented 3 years ago

You'll have to help reducing this then, as I can't reproduce it or find any references that would help me to.

jonathan-laurent commented 3 years ago

Are you referring to this error that happens when training a resnet on #644?

Because I also reported a failure in running the CUBLAS tests and it might be an easier place to start with. Is there anything else I can do to help diagnose why this test fails?

maleadt commented 3 years ago

Are you referring to this error that happens when training a resnet on #644?

Yes.

Because I also reported a failure in running the CUBLAS tests and it might be an easier place to start with. Is there anything else I can do to help diagnose why this test fails?

That's a different issue. And both of these are unrelated to the original OOM... So please open different issues with relevant reproducers, as it's not clear anymore which Julia version/CUDA.jl commit/debug settings you used for each of these, making it hard to help.

jonathan-laurent commented 3 years ago

Sorry for this! I had interpreted the following message in the thread as an invitation to run the CUBLAS tests myself and report on the results:

Add some details to #609 about that, it's probably a bug in CUDA.

I am going to make additional experiments and open separate issues if needed. Thanks for all your help and efforts!

jonathan-laurent commented 3 years ago

I do not have problems anymore on the tb/dlopen_cublaslt branch so I am closing this for now.

maleadt commented 3 years ago

So the OOM is fixed on master? Since the branch you mentions only fixes the cublas errors, not the general OOM. Or did you mean that you tested the stream-allocator branch? That branch does change the allocator, and could affect whether OOMs happen.

jonathan-laurent commented 3 years ago

The bug happens on 9bd25c215881a01e3adc91234eb2bda6c24be59f (the commit on master right before tb/dlopen_cublaslt was merged) but not on 99d8c4c69cd04a6281f429d67107e6b190c7c210 (after the merge). Therefore, my guess is that this bug wasn't actually an OOM bug (the error message did not say OOM explicitly but I had somewhat got used to receiving internal CUDNN errors in place of OOM errors).

I haven't tried the stream-allocator branch yet. I would be happy to do it now if you apply the tb/dlopen_cublaslt fix to this branch too or merge it on master.

maleadt commented 3 years ago

No, these are definitely OOMs and the cublasLt bug was unrelated. That's what I said in https://github.com/JuliaGPU/CUDA.jl/issues/714#issuecomment-782094035.

jonathan-laurent commented 3 years ago

Ok, so I probably got confused by the fact that CUDNN error (status 8) can apparently mean many different things.

I tried to replicate the OOM bug on many different versions. Here are the results:

CUDA v2.6.1: the bug happens
CUDA at 99d8c4c69cd04a6281f429d67107e6b190c7c210 (latest on master): the bug does not happen
CUDA at b109fda639a375387e43915be6ecaacadfe3c15f (before https://github.com/JuliaGPU/CUDA.jl/pull/718 is merged): the bug does not happen provided that I cherry-pick b84723524dfedc37f79d664cf4ba2704143b62fc to fix the cublaslt bug

In summary, the OOM bug does not happen anymore on master but this does not seem to be due to https://github.com/JuliaGPU/CUDA.jl/pull/718 or to the stream allocator (which I haven't tested yet).

Is there a particular point in history at which you want me to try and replicate the OOM bug?

maleadt commented 3 years ago

No, that's OK. No point in figuring out when it got fixed, as long as you'll file new issues if and when it comes back :-) I'd be interested in performance changes with the stream-ordered allocator though.

JuliaGPU / CUDA.jl