Minimal working EnsembleGPUArray() example

mantasjk commented 3 years ago

Hello, can not get the EnsembleGPUArray() to work on any of my machines. Simple solve() function on CUDA array works fine, however, when asked to do ensemble on gpu, it gives error "unsupported call to the Julia runtime (call to jl_f_tuple)".

Code:

using DifferentialEquations, CUDA, LinearAlgebra, DiffEqGPU
u0 = cu(rand(1000))
A  = cu(randn(1000,1000))
function f(du,u,p,t) 
    mul!(du,A,u)
    return nothing
end
prob = ODEProblem(f,u0,(0.0f0,1.0f0))
sol = solve(prob, Tsit5()) # this works
monteprob = EnsembleProblem(prob, safetycopy=false)
sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10000,saveat=1.0f0)

Output:

RROR: LoadError: InvalidIRError: compiling kernel gpu_gpu_kernel(Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel), typeof(f), CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{SciMLBase.NullParameters, 1}, Float32) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_f_tuple)
...
Stacktrace:
  [1] ntuple(::Adapt.var"
    @ ntuple.jl:19
  [2] overdub
    @ ntuple.jl:19
  [3] overdub
    @ ~/.julia/packages/Adapt/orWD2/src/base.jl:16
  [4] adapt(::KernelAbstractions.ConstAdaptor, ::typeof(f))
    @ ~/.julia/packages/Adapt/orWD2/src/Adapt.jl:40
  [5] overdub
    @ ~/.julia/packages/Adapt/orWD2/src/Adapt.jl:40
  [6] constify(::typeof(f))
    @ ~/.julia/packages/KernelAbstractions/fGTLM/src/KernelAbstractions.jl:313
  [7] overdub
    @ ~/.julia/packages/KernelAbstractions/fGTLM/src/KernelAbstractions.jl:313
  [8] gpu_gpu_kernel(::typeof(f), ::CuDeviceMatrix{Float32, 1}, ::CuDeviceMatrix{Float32, 1}, ::CuDeviceMatrix{SciMLBase.NullParameters, 1}, ::Float32)
    @ none:0
  [9] overdub
    @ none:0
 [10] overdub
    @ ~/.julia/packages/Cassette/jxIEh/src/overdub.jl:0
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(Cassette.overdub), Tuple{Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel), typeof(f), CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{SciMLBase.NullParameters, 1}, Float32}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XwWPj/src/validation.jl:123
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/XwWPj/src/driver.jl:288 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/4QAIk/src/TimerOutput.jl:206 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/XwWPj/src/driver.jl:286 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module, kernel::LLVM.Function; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XwWPj/src/utils.jl:62
  [6] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/qEV3Y/src/compiler/execution.jl:306
  [7] check_cache
    @ ~/.julia/packages/GPUCompiler/XwWPj/src/cache.jl:44 [inlined]
  [8] cached_compilation
    @ ./none:0 [inlined]
  [9] cufunction(f::typeof(Cassette.overdub), tt::Type{Tuple{Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel), typeof(f), CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{SciMLBase.NullParameters, 1}, Float32}}; name::String, kwargs::Base.Iterators.Pairs{Symbol, Nothing, Tuple{Symbol}, NamedTuple{(:maxthreads,), Tuple{Nothing}}})
    @ CUDA ~/.julia/packages/CUDA/qEV3Y/src/compiler/execution.jl:294
 [10] macro expansion
    @ ~/.julia/packages/CUDA/qEV3Y/src/compiler/execution.jl:102 [inlined]
 [11] (::KernelAbstractions.Kernel{KernelAbstractions.CUDADevice, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.gpu_gpu_kernel)})(::Function, ::Vararg{Any, N} where N; ndrange::Int64, dependencies::KernelAbstractions.CudaEvent, workgroupsize::Int64, progress::Function)
    @ KernelAbstractions ~/.julia/packages/KernelAbstractions/fGTLM/src/backends/cuda.jl:187
 [12] (::DiffEqGPU.var"#55#59"{typeof(f)})(du::CuArray{Float32, 2}, u::CuArray{Float32, 2}, p::CuArray{SciMLBase.NullParameters, 2}, t::Float32)
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/eOLep/src/DiffEqGPU.jl:356
 [13] ODEFunction
    @ ~/.julia/packages/SciMLBase/zGIUN/src/scimlfunctions.jl:334 [inlined]
 [14] initialize!(integrator::OrdinaryDiffEq.ODEIntegrator{Tsit5, true, CuArray{Float32, 2}, Nothing, Float32, CuArray{SciMLBase.NullParameters, 2}, Float32, Float32, Float32, Vector{CuArray{Float32, 2}}, ODESolution{Float32, 3, Vector{CuArray{Float32, 2}}, Nothing, Nothing, Vector{Float32}, Vector{Vector{CuArray{Float32, 2}}}, ODEProblem{CuArray{Float32, 2}, Tuple{Float32, Float32}, true, CuArray{SciMLBase.NullParameters, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(f)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, Tsit5, OrdinaryDiffEq.InterpolationData{ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(f)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{CuArray{Float32, 2}}, Vector{Float32}, Vector{Vector{CuArray{Float32, 2}}}, OrdinaryDiffEq.Tsit5Cache{CuArray{Float32, 2}, CuArray{Float32, 2}, CuArray{Float32, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float32, Float32}}}, DiffEqBase.DEStats}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(f)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, OrdinaryDiffEq.Tsit5Cache{CuArray{Float32, 2}, CuArray{Float32, 2}, CuArray{Float32, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float32, Float32}}, OrdinaryDiffEq.DEOptions{Float32, Float32, Float32, Float32, typeof(DiffEqGPU.diffeqgpunorm), typeof(opnorm), Nothing, CallbackSet{Tuple{}, Tuple{}}, typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), DiffEqGPU.var"#10#16", DataStructures.BinaryMinHeap{Float32}, DataStructures.BinaryMinHeap{Float32}, Nothing, Nothing, Int64, Tuple{}, Float32, Tuple{}}, CuArray{Float32, 2}, Float32, Nothing, OrdinaryDiffEq.DefaultInit}, cache::OrdinaryDiffEq.Tsit5Cache{CuArray{Float32, 2}, CuArray{Float32, 2}, CuArray{Float32, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float32, Float32}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/5egkj/src/perform_step/low_order_rk_perform_step.jl:623
 [15] __init(prob::ODEProblem{CuArray{Float32, 2}, Tuple{Float32, Float32}, true, CuArray{SciMLBase.NullParameters, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(f)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, alg::Tsit5, timeseries_init::Tuple{}, ts_init::Tuple{}, ks_init::Tuple{}, recompile::Type{Val{true}}; saveat::Float32, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_on::Bool, save_start::Bool, save_end::Nothing, callback::Nothing, dense::Bool, calck::Bool, dt::Float32, dtmin::Nothing, dtmax::Float32, force_dtmin::Bool, adaptive::Bool, gamma::Rational{Int64}, abstol::Nothing, reltol::Nothing, qmin::Rational{Int64}, qmax::Int64, qsteady_min::Int64, qsteady_max::Int64, qoldinit::Rational{Int64}, fullnormalize::Bool, failfactor::Int64, beta1::Nothing, beta2::Nothing, maxiters::Int64, internalnorm::typeof(DiffEqGPU.diffeqgpunorm), internalopnorm::typeof(opnorm), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::DiffEqGPU.var"#10#16", verbose::Bool, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), userdata::Nothing, allow_extrapolation::Bool, initialize_integrator::Bool, alias_u0::Bool, alias_du0::Bool, initializealg::OrdinaryDiffEq.DefaultInit, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/5egkj/src/solve.jl:433
 [16] #__solve#404
    @ ~/.julia/packages/OrdinaryDiffEq/5egkj/src/solve.jl:4 [inlined]
 [17] #solve_call#56
    @ ~/.julia/packages/DiffEqBase/rN9Px/src/solve.jl:61 [inlined]
 [18] solve_up(prob::ODEProblem{CuArray{Float32, 2}, Tuple{Float32, Float32}, true, CuArray{SciMLBase.NullParameters, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(f)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, sensealg::Nothing, u0::CuArray{Float32, 2}, p::CuArray{SciMLBase.NullParameters, 2}, args::Tsit5; kwargs::Base.Iterators.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:unstable_check, :saveat, :callback, :merge_callbacks, :internalnorm), Tuple{DiffEqGPU.var"#10#16", Float32, Nothing, Bool, typeof(DiffEqGPU.diffeqgpunorm)}}})
    @ DiffEqBase ~/.julia/packages/DiffEqBase/rN9Px/src/solve.jl:82
 [19] #solve#57
    @ ~/.julia/packages/DiffEqBase/rN9Px/src/solve.jl:70 [inlined]
 [20] batch_solve_up(ensembleprob::EnsembleProblem{ODEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, SciMLBase.NullParameters, ODEFunction{true, typeof(f), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(SciMLBase.DEFAULT_PROB_FUNC), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, probs::Vector{ODEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, SciMLBase.NullParameters, ODEFunction{true, typeof(f), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, alg::Tsit5, ensemblealg::EnsembleGPUArray, I::UnitRange{Int64}, u0::Matrix{Float32}, p::Matrix{SciMLBase.NullParameters}; kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:unstable_check, :saveat), Tuple{DiffEqGPU.var"#10#16", Float32}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/eOLep/src/DiffEqGPU.jl:261
 [21] batch_solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, SciMLBase.NullParameters, ODEFunction{true, typeof(f), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(SciMLBase.DEFAULT_PROB_FUNC), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleGPUArray, I::UnitRange{Int64}; kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:unstable_check, :saveat), Tuple{DiffEqGPU.var"#10#16", Float32}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/eOLep/src/DiffEqGPU.jl:226
 [22] macro expansion
    @ ./timing.jl:287 [inlined]
 [23] __solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, SciMLBase.NullParameters, ODEFunction{true, typeof(f), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(SciMLBase.DEFAULT_PROB_FUNC), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleGPUArray; trajectories::Int64, batch_size::Int64, unstable_check::Function, kwargs::Base.Iterators.Pairs{Symbol, Float32, Tuple{Symbol}, NamedTuple{(:saveat,), Tuple{Float32}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/eOLep/src/DiffEqGPU.jl:143
 [24] #solve#59
    @ ~/.julia/packages/DiffEqBase/rN9Px/src/solve.jl:96 [inlined]
 [25] top-level scope
    @ ./timing.jl:210
in expression starting at /home/mantas/mantas/d2julia/gpu_test.jl:10

ChrisRackauckas commented 3 years ago

This example doesn't make any sense. If you have large ODEs you just use CuArrays, not EnsembleGPUArray. Did you see the README?

https://github.com/SciML/DiffEqGPU.jl#within-method-gpu-parallelism-with-direct-cuarray-usage

The native Julia libraries, including (but not limited to) OrdinaryDiffEq, StochasticDiffEq, and DelayDiffEq, are compatible with u0 being a CuArray. When this occurs, all array operations take place on the GPU, including any implicit solves. This is independent of the DiffEqGPU library. These speedup the solution of a differential equation which is sufficiently large or expensive. This does not require DiffEqGPU.jl.

Parameter-parallel GPU methods are provided for the case where a single solve is too cheap to benefit from within-method parallelism, but the solution of the same structure (same f) is required for very many different choices of u0 or p.

You would not solve that problem with EnsembleGPUArray. Let me know if that needs to be clarified more.

mantasjk commented 3 years ago

I was hoping to utilize EnsembleGPUArray() to compute ensemble over different initial conditions, but as ODE system is hard and difficult, then this is not the best approach. Thanks for the clarification.

ChrisRackauckas commented 3 years ago

Yes it's only for small equations, like 200 ODEs or less. It fills your GPU by running them all simultaneously. So if you have a big PDE with a matmul operator kind of thing, it would just overflow the GPU memory. PDE discretizations = solve one at a time, parallelizing its operations on the GPU. Small ODEs = solve many at the same time, parallelizing across solves on the GPU.

SciML / DiffEqGPU.jl

Minimal working EnsembleGPUArray() example #101