EnsembleGPUArray performance vs EnsembleSerial

RXGottlieb commented 2 years ago

Hi! Using the Lorenz example in the README, EnsembleGPUArray seems to be running quite a bit slower than all other methods, including EnsembleSerial. On my machine I get:

using DiffEqGPU, OrdinaryDiffEq
function lorenz(du,u,p,t)
    du[1] = p[1]*(u[2]-u[1])
    du[2] = u[1]*(p[2]-u[3]) - u[2]
    du[3] = u[1]*u[2] - p[3]*u[3]
end

u0 = Float32[1.0;0.0;0.0]
tspan = (0.0f0,100.0f0)
p = [10.0f0,28.0f0,8/3f0]
prob = ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob,p=rand(Float32,3).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false)

@time sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=10_000,saveat=1.0f0)
# 8.197300 seconds (21.42 M allocations: 1.551 GiB, 5.59% gc time)

@time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)
# 45.863792 seconds (118.46 M allocations: 7.534 GiB, 4.07% gc time, 8.85% compilation time)

Currently on DiffEqGPU v.1.16.0 and OrdinaryDiffEq v6.6.6. GPU is an NVIDIA Quadro T2000, CUDA version 11.6.

pcjentsch commented 2 years ago

It looks like you might be hitting a lot of compilation time? I am not sure if @time counts GPUCompiler compilation time in its reporting of "8.85% compilation time".

Exchanging @time for @btime I get

@btime sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=10_000,saveat=1.0f0)
  688.809 ms (1524808 allocations: 149.89 MiB)

@btime sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)
  434.094 ms (1304064 allocations: 880.95 MiB)

My GPU is a 2060.

I am not sure why @time is used in the README.

RXGottlieb commented 2 years ago

Looks like that accounts for a lot of it, though on my machine EnsembleGPUArray is still slower:

@btime sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=10_000,saveat=1.0f0)
    1.086 s (2544793 allocations: 201.55 MiB)
@btime sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)
    1.449 s (1559235 allocations: 895.31 MiB)

For another comparison, I tried running the example multi-GPU script (with CUDA replacing CuArrays) on a machine with two GV100's and got the same kind of performance difference:

using DiffEqGPU, CUDA, OrdinaryDiffEq, Test, BenchmarkTools
CUDA.device!(0)

using Distributed
addprocs(2)
@everywhere using DiffEqGPU, CUDA, OrdinaryDiffEq, Test, Random

@everywhere begin
    function lorenz_distributed(du,u,p,t)
        du[1] = p[1]*(u[2]-u[1])
        du[2] = u[1]*(p[2]-u[3]) - u[2]
        du[3] = u[1]*u[2] - p[3]*u[3]
    end
    CUDA.allowscalar(false)
    u0 = Float32[1.0;0.0;0.0]
    tspan = (0.0f0,100.0f0)
    p = [10.0f0,28.0f0,8/3f0]
    Random.seed!(1)
    pre_p_distributed = [rand(Float32,3) for i in 1:100_000]
    function prob_func_distributed(prob,i,repeat)
        remake(prob,p=pre_p_distributed[i].*p)
    end
end

@sync begin
    @spawnat 2 begin
        CUDA.allowscalar(false)
        CUDA.device!(0)
    end
    @spawnat 3 begin
        CUDA.allowscalar(false)
        CUDA.device!(1)
    end
end

CUDA.allowscalar(false)
prob = ODEProblem(lorenz_distributed,u0,tspan,p)
monteprob = EnsembleProblem(prob, prob_func = prob_func_distributed)

@btime sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=100_000,batch_size=50_000,saveat=1.0f0)
#   14.605 s (26457532 allocations: 2.08 GiB)

@btime sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,batch_size=50_000,saveat=1.0f0)
#   104.737 s (189837890 allocations: 38.78 GiB)

ChrisRackauckas commented 2 years ago

It looks like you might be hitting a lot of compilation time? I am not sure if @time counts GPUCompiler compilation time in its reporting of "8.85% compilation time".

It doesn't.

But note the current setup isn't great so we're building a new one which is better for non-stiff ODEs.

RXGottlieb commented 2 years ago

Got it, and thanks for the responses!

Also, I noticed that both EnsembleSerial and EnsembleGPUArray cause my GPU to jump to ~13% utilization. Is that normal? I would expect the utilization to be much higher for EnsembleGPUArray.

ChrisRackauckas commented 2 years ago

EnsembleSerial doesn't use the GPU unless you're using GPU arrays. The utilization is dependent on how packaged the kernels are. You want to use like 100,000 trajectories and a big enough ODE to pack the kernels with the current version. That's why we're building a different one that's a lot less limited.

ChrisRackauckas commented 1 year ago

EnsembleGPUKernel is a lot faster, so that's the one to make use of.

SciML / DiffEqGPU.jl

EnsembleGPUArray performance vs EnsembleSerial #147