CUDA error: too many resources requested for launch

dirypan commented 1 year ago

Hi I am trying to simulate a model with GPU on a server, here is the simulation code:

model_2D3D = ABM(3,

    model = Dict(
        :DL => Float64,
        :rL => Float64,
    ),

    medium = Dict(
        :L => Float64,
    ),

    mediumODE = quote
        if @mediumInside()
            dt(L) = DL*(@∂2(1,L)+@∂2(2,L)+@∂2(3,L)) - rL*L
        elseif @mediumBorder(1,-1)
            L = L[2,i2_,i3_] #Neumann boundary condition
        elseif @mediumBorder(1,1)
            L = L[NMedium[1]-1,i2_,i3_] 
        elseif @mediumBorder(2,-1)
            L = L[i1_,2,i3_]
        elseif @mediumBorder(2,1)
            L = L[i1_,NMedium[2]-1,i3_]
        elseif @mediumBorder(3,-1)
            L = L[i1_,i2_,2] 
        elseif @mediumBorder(3,1)
            L = L[i1_,i2_,NMedium[3]-1] 
        end
    end,
    platform=GPU(),
); 

com = Community(
    model_2D3D,
    dt=0.01,
    simBox=[-3 3;-3 3;-0.3 3],
    NMedium=[10,5,6],
)
com.DL = 0.1
com.rL = 5.
evolve!(com,steps=400,saveEach=10,
        progressMessage=(com)->println("Step t: $(round(com.t,digits=2))"))

Here is the error:

CUDA error: too many resources requested for launch (code 701, ERROR_LAUNCH_OUT_OF_RESOURCES)

Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/p5OVK/lib/cudadrv/libcuda.jl:27
  [2] macro expansion
    @ ~/.julia/packages/CUDA/p5OVK/lib/cudadrv/libcuda.jl:35 [inlined]
  [3] cuLaunchKernel
    @ ~/.julia/packages/CUDA/p5OVK/lib/utils/call.jl:26 [inlined]
  [4] (::CUDA.var"#35#36"{Bool, Int64, CuStream, CuFunction, CuDim3, CuDim3})(kernelParams::Vector{Ptr{Nothing}})
    @ CUDA ~/.julia/packages/CUDA/p5OVK/lib/cudadrv/execution.jl:69
  [5] macro expansion
    @ ~/.julia/packages/CUDA/p5OVK/lib/cudadrv/execution.jl:33 [inlined]
  [6] macro expansion
    @ ./none:0 [inlined]
  [7] pack_arguments(::CUDA.var"#35#36"{Bool, Int64, CuStream, CuFunction, CuDim3, CuDim3}, ::CUDA.KernelState, ::CuDeviceArray{Float32, 4, 1}, ::CuDeviceArray{Float32, 4, 1}, ::Float64, ::Float64, ::Int64, ::CuDeviceVector{Int64, 1}, ::Int64, ::CuDeviceVector{Float32, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceMatrix{Float32, 1}, ::Float64, ::Float64, ::Float64, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Float32, 1}, ::CuDeviceVector{Float32, 1}, ::CuDeviceVector{Float32, 1}, ::CuDeviceArray{Float32, 3, 1}, ::CuDeviceArray{Float32, 3, 1}, ::CuDeviceArray{Float32, 3, 1}, ::CuDeviceVector{Float32, 1}, ::CuDeviceVector{Float32, 1})
    @ CUDA ./none:0
  [8] #launch#34
    @ ~/.julia/packages/CUDA/p5OVK/lib/cudadrv/execution.jl:62 [inlined]
  [9] #40
    @ ~/.julia/packages/CUDA/p5OVK/lib/cudadrv/execution.jl:136 [inlined]
 [10] macro expansion
    @ ~/.julia/packages/CUDA/p5OVK/lib/cudadrv/execution.jl:95 [inlined]
 [11] macro expansion
    @ ./none:0 [inlined]
 [12] convert_arguments
    @ ./none:0 [inlined]
 [13] #cudacall#39
    @ ~/.julia/packages/CUDA/p5OVK/lib/cudadrv/execution.jl:135 [inlined]
 [14] cudacall
    @ ~/.julia/packages/CUDA/p5OVK/lib/cudadrv/execution.jl:134 [inlined]
 [15] macro expansion
    @ ~/.julia/packages/CUDA/p5OVK/src/compiler/execution.jl:212 [inlined]
 [16] macro expansion
    @ ./none:0 [inlined]
 [17] call(::CUDA.HostKernel{var"#kernel#150", Tuple{CuDeviceArray{Float32, 4, 1}, CuDeviceArray{Float32, 4, 1}, Float64, Float64, Int64, CuDeviceVector{Int64, 1}, Int64, CuDeviceVector{Float32, 1}, CuDeviceVector{Int64, 1}, CuDeviceMatrix{Float32, 1}, Float64, Float64, Float64, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}, CuDeviceArray{Float32, 3, 1}, CuDeviceArray{Float32, 3, 1}, CuDeviceArray{Float32, 3, 1}, CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}}}, ::CuDeviceArray{Float32, 4, 1}, ::CuDeviceArray{Float32, 4, 1}, ::Float64, ::Float64, ::Int64, ::CuDeviceVector{Int64, 1}, ::Int64, ::CuDeviceVector{Float32, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceMatrix{Float32, 1}, ::Float64, ::Float64, ::Float64, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Int64, 1}, ::CuDeviceVector{Float32, 1}, ::CuDeviceVector{Float32, 1}, ::CuDeviceVector{Float32, 1}, ::CuDeviceArray{Float32, 3, 1}, ::CuDeviceArray{Float32, 3, 1}, ::CuDeviceArray{Float32, 3, 1}, ::CuDeviceVector{Float32, 1}, ::CuDeviceVector{Float32, 1}; call_kwargs::Base.Pairs{Symbol, Tuple{Int64, Int64, Int64}, Tuple{Symbol, Symbol}, NamedTuple{(:threads, :blocks), Tuple{Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}}}})
    @ CUDA ./none:0
 [18] (::CUDA.HostKernel{var"#kernel#150", Tuple{CuDeviceArray{Float32, 4, 1}, CuDeviceArray{Float32, 4, 1}, Float64, Float64, Int64, CuDeviceVector{Int64, 1}, Int64, CuDeviceVector{Float32, 1}, CuDeviceVector{Int64, 1}, CuDeviceMatrix{Float32, 1}, Float64, Float64, Float64, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, CuDeviceVector{Int64, 1}, CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}, CuDeviceArray{Float32, 3, 1}, CuDeviceArray{Float32, 3, 1}, CuDeviceArray{Float32, 3, 1}, CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}}})(::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; threads::Tuple{Int64, Int64, Int64}, blocks::Tuple{Int64, Int64, Int64}, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/p5OVK/src/compiler/execution.jl:333
 [19] macro expansion
    @ ~/.julia/packages/CUDA/p5OVK/src/compiler/execution.jl:106 [inlined]
 [20] macro expansion
    @ ~/.julia/packages/CUDA/p5OVK/src/utilities.jl:25 [inlined]
 [21] (::var"#148#149")(dVar_::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, var_::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, p_::Vector{Any}, t_::Float64)
    @ Main ~/.julia/packages/CellBasedModels/uJNkT/src/AgentStructure/functionDE.jl:101
 [22] ODEFunction
    @ ~/.julia/packages/SciMLBase/VdcHg/src/scimlfunctions.jl:2126 [inlined]
 [23] perform_step!(integrator::OrdinaryDiffEq.ODEIntegrator{CompositeAlgorithm{Tuple{Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, Rosenbrock23{12, true, QRFactorization{LinearAlgebra.NoPivot}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}}, OrdinaryDiffEq.AutoSwitchCache{Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, Rosenbrock23{0, true, Nothing, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, Rational{Int64}, Int64}}, true, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Nothing, Float64, Vector{Any}, Float64, Float32, Float32, Float64, Vector{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}, ODESolution{Float32, 5, Vector{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}, Nothing, Nothing, Vector{Float64}, Vector{Vector{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}}, ODEProblem{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Float64, Float64}, true, Vector{Any}, ODEFunction{true, SciMLBase.AutoSpecialize, var"#148#149", LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, CompositeAlgorithm{Tuple{Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, Rosenbrock23{12, true, QRFactorization{LinearAlgebra.NoPivot}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}}, OrdinaryDiffEq.AutoSwitchCache{Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, Rosenbrock23{0, true, Nothing, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, Rational{Int64}, Int64}}, OrdinaryDiffEq.CompositeInterpolationData{ODEFunction{true, SciMLBase.AutoSpecialize, var"#148#149", LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Vector{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}, Vector{Float64}, Vector{Vector{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}}, OrdinaryDiffEq.CompositeCache{Tuple{OrdinaryDiffEq.Tsit5Cache{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, OrdinaryDiffEq.Rosenbrock23Cache{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, OrdinaryDiffEq.Rosenbrock23Tableau{Float32}, SciMLBase.TimeGradientWrapper{ODEFunction{true, SciMLBase.AutoSpecialize, var"#148#149", LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Vector{Any}}, SciMLBase.UJacobianWrapper{ODEFunction{true, SciMLBase.AutoSpecialize, var"#148#149", LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Float64, Vector{Any}}, LinearSolve.LinearCache{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, SciMLBase.NullParameters, QRFactorization{LinearAlgebra.NoPivot}, LinearAlgebra.QR{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, LinearSolve.InvPreconditioner{LinearAlgebra.Diagonal{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, LinearAlgebra.Diagonal{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Float32, true, LinearSolve.OperatorCondition.IllConditioned}, SparseDiffTools.ForwardColorJacCache{CuArray{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 12}, 4, CUDA.Mem.DeviceBuffer}, CuArray{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 12}, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Vector{CuArray{NTuple{12, Float32}, 1, CUDA.Mem.DeviceBuffer}}, UnitRange{Int64}, Nothing}, CuArray{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}, 4, CUDA.Mem.DeviceBuffer}, Float32, Rosenbrock23{12, true, QRFactorization{LinearAlgebra.NoPivot}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, Nothing}}, OrdinaryDiffEq.AutoSwitchCache{Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, Rosenbrock23{0, true, Nothing, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, Rational{Int64}, Int64}}}, DiffEqBase.Stats, Vector{Int64}}, ODEFunction{true, SciMLBase.AutoSpecialize, var"#148#149", LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, OrdinaryDiffEq.CompositeCache{Tuple{OrdinaryDiffEq.Tsit5Cache{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, OrdinaryDiffEq.Rosenbrock23Cache{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, OrdinaryDiffEq.Rosenbrock23Tableau{Float32}, SciMLBase.TimeGradientWrapper{ODEFunction{true, SciMLBase.AutoSpecialize, var"#148#149", LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Vector{Any}}, SciMLBase.UJacobianWrapper{ODEFunction{true, SciMLBase.AutoSpecialize, var"#148#149", LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Float64, Vector{Any}}, LinearSolve.LinearCache{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, SciMLBase.NullParameters, QRFactorization{LinearAlgebra.NoPivot}, LinearAlgebra.QR{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, LinearSolve.InvPreconditioner{LinearAlgebra.Diagonal{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, LinearAlgebra.Diagonal{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Float32, true, LinearSolve.OperatorCondition.IllConditioned}, SparseDiffTools.ForwardColorJacCache{CuArray{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 12}, 4, CUDA.Mem.DeviceBuffer}, CuArray{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 12}, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Vector{CuArray{NTuple{12, Float32}, 1, CUDA.Mem.DeviceBuffer}}, UnitRange{Int64}, Nothing}, CuArray{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}, 4, CUDA.Mem.DeviceBuffer}, Float32, Rosenbrock23{12, true, QRFactorization{LinearAlgebra.NoPivot}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, Nothing}}, OrdinaryDiffEq.AutoSwitchCache{Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, Rosenbrock23{0, true, Nothing, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, Rational{Int64}, Int64}}, OrdinaryDiffEq.DEOptions{Float32, Float32, Float32, Float64, PIController{Rational{Int64}}, typeof(DiffEqBase.ODE_DEFAULT_NORM), typeof(LinearAlgebra.opnorm), Nothing, CallbackSet{Tuple{}, Tuple{}}, typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK), DataStructures.BinaryHeap{Float64, DataStructures.FasterForward}, DataStructures.BinaryHeap{Float64, DataStructures.FasterForward}, Nothing, Nothing, Int64, Tuple{}, Tuple{}, Tuple{}}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Float32, Nothing, OrdinaryDiffEq.DefaultInit}, cache::OrdinaryDiffEq.Tsit5Cache{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, repeat_step::Bool)
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/gjQVg/src/perform_step/low_order_rk_perform_step.jl:779
 [24] perform_step!
    @ ~/.julia/packages/OrdinaryDiffEq/gjQVg/src/perform_step/composite_perform_step.jl:71 [inlined]
 [25] perform_step!
    @ ~/.julia/packages/OrdinaryDiffEq/gjQVg/src/perform_step/composite_perform_step.jl:70 [inlined]
 [26] step!
    @ ~/.julia/packages/OrdinaryDiffEq/gjQVg/src/iterator_interface.jl:14 [inlined]
 [27] step!(integ::OrdinaryDiffEq.ODEIntegrator{CompositeAlgorithm{Tuple{Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, Rosenbrock23{12, true, QRFactorization{LinearAlgebra.NoPivot}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}}, OrdinaryDiffEq.AutoSwitchCache{Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, Rosenbrock23{0, true, Nothing, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, Rational{Int64}, Int64}}, true, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Nothing, Float64, Vector{Any}, Float64, Float32, Float32, Float64, Vector{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}, ODESolution{Float32, 5, Vector{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}, Nothing, Nothing, Vector{Float64}, Vector{Vector{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}}, ODEProblem{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Tuple{Float64, Float64}, true, Vector{Any}, ODEFunction{true, SciMLBase.AutoSpecialize, var"#148#149", LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, CompositeAlgorithm{Tuple{Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, Rosenbrock23{12, true, QRFactorization{LinearAlgebra.NoPivot}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}}, OrdinaryDiffEq.AutoSwitchCache{Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, Rosenbrock23{0, true, Nothing, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, Rational{Int64}, Int64}}, OrdinaryDiffEq.CompositeInterpolationData{ODEFunction{true, SciMLBase.AutoSpecialize, var"#148#149", LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Vector{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}, Vector{Float64}, Vector{Vector{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}}, OrdinaryDiffEq.CompositeCache{Tuple{OrdinaryDiffEq.Tsit5Cache{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, OrdinaryDiffEq.Rosenbrock23Cache{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, OrdinaryDiffEq.Rosenbrock23Tableau{Float32}, SciMLBase.TimeGradientWrapper{ODEFunction{true, SciMLBase.AutoSpecialize, var"#148#149", LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Vector{Any}}, SciMLBase.UJacobianWrapper{ODEFunction{true, SciMLBase.AutoSpecialize, var"#148#149", LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Float64, Vector{Any}}, LinearSolve.LinearCache{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, SciMLBase.NullParameters, QRFactorization{LinearAlgebra.NoPivot}, LinearAlgebra.QR{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, LinearSolve.InvPreconditioner{LinearAlgebra.Diagonal{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, LinearAlgebra.Diagonal{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Float32, true, LinearSolve.OperatorCondition.IllConditioned}, SparseDiffTools.ForwardColorJacCache{CuArray{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 12}, 4, CUDA.Mem.DeviceBuffer}, CuArray{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 12}, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Vector{CuArray{NTuple{12, Float32}, 1, CUDA.Mem.DeviceBuffer}}, UnitRange{Int64}, Nothing}, CuArray{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}, 4, CUDA.Mem.DeviceBuffer}, Float32, Rosenbrock23{12, true, QRFactorization{LinearAlgebra.NoPivot}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, Nothing}}, OrdinaryDiffEq.AutoSwitchCache{Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, Rosenbrock23{0, true, Nothing, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, Rational{Int64}, Int64}}}, DiffEqBase.Stats, Vector{Int64}}, ODEFunction{true, SciMLBase.AutoSpecialize, var"#148#149", LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, OrdinaryDiffEq.CompositeCache{Tuple{OrdinaryDiffEq.Tsit5Cache{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, OrdinaryDiffEq.Rosenbrock23Cache{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, OrdinaryDiffEq.Rosenbrock23Tableau{Float32}, SciMLBase.TimeGradientWrapper{ODEFunction{true, SciMLBase.AutoSpecialize, var"#148#149", LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Vector{Any}}, SciMLBase.UJacobianWrapper{ODEFunction{true, SciMLBase.AutoSpecialize, var"#148#149", LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, Float64, Vector{Any}}, LinearSolve.LinearCache{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, SciMLBase.NullParameters, QRFactorization{LinearAlgebra.NoPivot}, LinearAlgebra.QR{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, LinearSolve.InvPreconditioner{LinearAlgebra.Diagonal{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, LinearAlgebra.Diagonal{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Float32, true, LinearSolve.OperatorCondition.IllConditioned}, SparseDiffTools.ForwardColorJacCache{CuArray{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 12}, 4, CUDA.Mem.DeviceBuffer}, CuArray{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 12}, 4, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Vector{CuArray{NTuple{12, Float32}, 1, CUDA.Mem.DeviceBuffer}}, UnitRange{Int64}, Nothing}, CuArray{ForwardDiff.Dual{ForwardDiff.Tag{DiffEqBase.OrdinaryDiffEqTag, Float32}, Float32, 1}, 4, CUDA.Mem.DeviceBuffer}, Float32, Rosenbrock23{12, true, QRFactorization{LinearAlgebra.NoPivot}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, Nothing}}, OrdinaryDiffEq.AutoSwitchCache{Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, Rosenbrock23{0, true, Nothing, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, Rational{Int64}, Int64}}, OrdinaryDiffEq.DEOptions{Float32, Float32, Float32, Float64, PIController{Rational{Int64}}, typeof(DiffEqBase.ODE_DEFAULT_NORM), typeof(LinearAlgebra.opnorm), Nothing, CallbackSet{Tuple{}, Tuple{}}, typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK), DataStructures.BinaryHeap{Float64, DataStructures.FasterForward}, DataStructures.BinaryHeap{Float64, DataStructures.FasterForward}, Nothing, Nothing, Int64, Tuple{}, Tuple{}, Tuple{}}, CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Float32, Nothing, OrdinaryDiffEq.DefaultInit}, dt::Float64, stop_at_tdt::Bool)
    @ SciMLBase ~/.julia/packages/SciMLBase/VdcHg/src/integrator_interface.jl:856
 [28] mediumStepDE!(community::Community)
    @ CellBasedModels ~/.julia/packages/CellBasedModels/uJNkT/src/CommunityStructure/step.jl:85
 [29] step!(community::Community)
    @ CellBasedModels ~/.julia/packages/CellBasedModels/uJNkT/src/CommunityStructure/step.jl:101
 [30] evolve!(community::Community; steps::Int64, saveEach::Int64, saveToFile::Bool, fileName::Nothing, overwrite::Bool, saveCurrentState::Bool, preallocateAgents::Int64, progressMessage::var"#159#160")
    @ CellBasedModels ~/.julia/packages/CellBasedModels/uJNkT/src/CommunityStructure/step.jl:144
 [31] top-level scope
    @ In[58]:1

If I switch the NMedium=[10,5,6] to NMedium = [10, 4, 6] this code will run without problem. (I didn't add an agent here because I found that changing the agent number has no effect on this error)

Here is the CUDA.versioninfo()

CUDA runtime 11.7, artifact installation
CUDA driver 12.2
NVIDIA driver 535.54.3

Libraries: 
- CUBLAS: 11.10.3
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.4.0
- CUSPARSE: 11.7.4
- CUPTI: 17.0.0
- NVML: 12.0.0+535.54.3

Toolchain:
- Julia: 1.9.3
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA A100-SXM4-80GB (sm_80, 75.559 GiB / 80.000 GiB available)

Interestingly, the code works well on my own computer with the CUDA.versioninfo()

CUDA runtime 12.1, artifact installation
CUDA driver 12.2
NVIDIA driver 536.99.0

CUDA libraries: 
- CUBLAS: 12.2.4
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.5
- CUSPARSE: 12.1.0
- CUPTI: 18.0.0
- NVML: 12.0.0+535.98.1

Julia packages: 
- CUDA: 4.4.1
- CUDA_Driver_jll: 0.5.0+1
- CUDA_Runtime_jll: 0.6.0+0

Toolchain:
- Julia: 1.9.3
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce RTX 3090 Ti (sm_86, 5.121 GiB / 22.488 GiB available)

Can you help with this problem? What could be the problem with the server or the code?

gatocor commented 1 year ago

Do you have any limitation on the resources that can be launched in the server GPU when running a program?

Clearly it looks like you cannot use that much memory although a medium of 1056=300 shouldn't be too big.

gatocor commented 1 year ago

Can you output memory after trying to load the code?

Maybe instead of evolving directly you can just upload to platform the code and see how many resources is taking.

#Code....
#...
# com = Community(....)
println(CUDA.memory_status()) #Check memory
loadToPlatform!(com)
println(CUDA.memory_status()) #Check memory

dirypan commented 1 year ago

Thank you for the help. Here are the results from the server if I print the CUDA.memory_status()

Effective GPU memory usage: 0.53% (428.625 MiB/79.151 GiB)
Memory pool usage: 0 bytes (0 bytes reserved)
nothing
Effective GPU memory usage: 0.67% (544.625 MiB/79.151 GiB)
Memory pool usage: 1.446 MiB (32.000 MiB reserved)
nothing

And it has exactly the same error if I try it with evolve. On my own computer, it prints out

Effective GPU memory usage: 50.89% (11.444 GiB/22.488 GiB)
Memory pool usage: 2.803 GiB (10.188 GiB reserved)
nothing
Effective GPU memory usage: 50.89% (11.444 GiB/22.488 GiB)
Memory pool usage: 2.803 GiB (10.188 GiB reserved)
nothing

And it will run smoothly. All other CUDA info is the same as before. I guess it is not a memory constraint since now on the server with CUDA.versioninfo() it prints

1 device:
  0: NVIDIA A100-SXM4-80GB (sm_80, 78.619 GiB / 80.000 GiB available)

I should have a total of 80G memory to use on the server.

gatocor commented 1 year ago

The only thing that comes into my mind now is that the server has a limitation in the memory that you can send to it.

To make sure that it is not anything related with the package, could you try to create CUDA arrays of different sizes and see if the server accepts them?

println(CUDA.memory_status()) #Check memory
CUDA.zeros(100,6,5)
println(CUDA.memory_status()) #Check memory

and check up to what declaration size it raises the error. I would advise that you do it in the server with the REPL so the second print has time to show the actual info as if you see, the memory has not increased after uploading the Community object. That is weird, maybe it is simply that the info has not been updated by the time of execution of the second memory usage call.

dirypan commented 1 year ago

The reason it seems memory usage has not increased might just be zeros[100,6,5] is too small. This is the printout if I consecutively run three different sizes. It seems it will accept them and is still way below the memory restriction. I think it might related some low-level CUDA implementation of the step! function.

println(CUDA.memory_status()) #Check memory
CUDA.zeros(100,6,5)
println(CUDA.memory_status()) #Check memory

Effective GPU memory usage: 1.23% (996.625 MiB/79.151 GiB)
Memory pool usage: 11.719 KiB (32.000 MiB reserved)
nothing
Effective GPU memory usage: 1.23% (996.625 MiB/79.151 GiB)
Memory pool usage: 23.438 KiB (32.000 MiB reserved)
nothing

println(CUDA.memory_status()) #Check memory
CUDA.zeros(100,60,50)
println(CUDA.memory_status()) #Check memory

Effective GPU memory usage: 1.23% (996.625 MiB/79.151 GiB)
Memory pool usage: 23.438 KiB (32.000 MiB reserved)
nothing
Effective GPU memory usage: 1.23% (996.625 MiB/79.151 GiB)
Memory pool usage: 1.167 MiB (32.000 MiB reserved)
nothing

println(CUDA.memory_status()) #Check memory
CUDA.zeros(100,600,500)
println(CUDA.memory_status()) #Check memory

Effective GPU memory usage: 1.23% (996.625 MiB/79.151 GiB)
Memory pool usage: 1.167 MiB (32.000 MiB reserved)
nothing
Effective GPU memory usage: 1.35% (1.067 GiB/79.151 GiB)
Memory pool usage: 115.608 MiB (128.000 MiB reserved)
nothing

gatocor commented 1 year ago

Mmmm,

can you try to change the integrator of the medium to something very basic like an Euler integrator? This will not solve the problem since Euler is not a good integrator, but it may give clues of where is the problem.

It seems to crash during the mediumStepDE!(community::Community) that for the default integrator of medium calls the DifferentialEquations external package.

Could be that the solver is asking for a lot of resources in order to integrate appropriately. If that is the case, we may have to look carefully at the declaration.

dirypan commented 1 year ago

I tried a few other solvers in DifferentialEquations.jl including Euler, but they all raise the same error. I also found that the threshold for the error seems to be NMedium[1]*NMedium[2]*NMedium[3]>256. At least for NMedium = [3,3,28] it will run but not NMedium=[3,3,29].

dirypan commented 1 year ago

What's more, I found that if I decrease the dimension to 2, it will have no problem with much larger medium grid sizes (say 2000 by 2000), so the problem might be related to the third dimension declaration.

dirypan commented 1 year ago

Specifically, as long as I remove any diffusion on any dimension ( say @∂2(3,L)) term from the mediumODE everything works perfectly. But if I have all three terms together it will fail.

gatocor commented 1 year ago

Okay, so the problem seems to be the @∂2(3,L) operator. This operator is simply a wrapper for some code that is substituted to put the diffusion term, so you can write your own discretization accessing the positions in the matrix yourself. That is, intrinsically what the operator is doing is putting the following code:

@∂2(1,L) = (L[i1_+1,i2_,i3_]-2L[i1_+1,i2_,i3_]+L[i1_+1,i2_,i3_])/(dx^2)

So maybe there is a bug in this part. Can you try to write your own discretization and check if that solves the problem? If that is the case, there may be a problem in the operator declaration.

It is still weird that the problems jumps with system size and not every time.

gatocor commented 1 year ago

Also, have you tried with CPU instead and it works?

dirypan commented 1 year ago

So I replaced the wrapper with the following code

dt(L) = DL*( (L[i1_+1,i2_,i3_]-2*L[i1_,i2_,i3_]+L[i1_-1,i2_,i3_])/(dx^2) + 
                (L[i1_,i2_+1,i3_]-2*L[i1_,i2_,i3_]+L[i1_,i2_-1,i3_])/(dy^2)+ 
                (L[i1_,i2_,i3_+1]-2*L[i1_,i2_,i3_]+L[i1_,i2_,i3_-1])/(dz^2)) - rL*L

It raises the same problem above the mentioned threshold. So the wrapper is not the problem.

It works well with the CPU, jus the speed is too slow when I want to apply it to real simulations.

gatocor commented 1 year ago

What happens when you remove the diffusion in the z axis?

gatocor commented 1 year ago

And what happens if you apply squared conditions?

(NMedium Nx=Ny=Nz)
(simBox: x=y=z)
Both

dirypan commented 1 year ago

If I remove diffusion on any axis, it will work.

simBox:x,y,z does not affect the problem as I changed them. If Nx=Ny=Nz=N, then it will work if N=6 and doesn't work with N=7 as I guess 6^3<256<7^3

gatocor commented 1 year ago

But you keep having the problem if you keep the diffusion in the x and y axis but remove it in the z axis?

Or as far as you remove diffusion in a specific axis, it works?

dirypan commented 1 year ago

As far as keeping any two axis, it does not matter which one I remove, it will work. It will only have problem if I keep all three.

But you keep having the problem if you keep the diffusion in the x and y axis but remove it in the z axis?

Or as far as you remove diffusion in a specific axis, it works?

gatocor commented 1 year ago

Okay, this is very weird because is a bug that does not happen in every platform, nor in every GPU, nor in every situation.

I will need some days to try to figure it out. I will keep you posted.

dirypan commented 1 year ago

I have found a workaround. I defined all parameters and variables in Float32 and define their values by Float32 values. This does eliminate the error. I have a few questions on this:

If I claim the variables and parameters in the model definition as Float64 but on the GPU platform, will they will converted to float32 automatically for GPU acceleration? If not, how much performance can I gain from declaring everything in Float32 (is that just the FP32 vs FP64 in GPU specs) ?
Is the medium grid size limited by system/GPU memory?

gatocor commented 1 year ago

Most GPUs work with Float32 only and usually give an error. Only Quadro and other high-performance GPUs work with Float64.

In theory, when loading to GPU, it should have transformed everything to Float32 by default. Maybe during the conversion some array did not convert. I will search for this, this narrows the bugs.
Yes, it is. When You bring the medium to GPU/CPU if is a matrix of KNxNy*Nz; where K is the positive integer which depends on the number of additional integration steps required to be saved by the integrator algorithm. So, this will be typically the heaviest object of the simulation in general if you use a medium.

dirypan commented 1 year ago

I guess this is the problem since A100 can do Float64 and my 3090ti can't.

Only Quadro and other high-performance GPUs work with Float64.

dsb-lab / CellBasedModels.jl

CUDA error: too many resources requested for launch #45