Closed RXGottlieb closed 1 year ago
It looks like you might be hitting a lot of compilation time? I am not sure if @time
counts GPUCompiler
compilation time in its reporting of "8.85% compilation time".
Exchanging @time
for @btime
I get
@btime sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=10_000,saveat=1.0f0)
688.809 ms (1524808 allocations: 149.89 MiB)
@btime sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)
434.094 ms (1304064 allocations: 880.95 MiB)
My GPU is a 2060.
I am not sure why @time
is used in the README.
Looks like that accounts for a lot of it, though on my machine EnsembleGPUArray
is still slower:
@btime sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=10_000,saveat=1.0f0)
1.086 s (2544793 allocations: 201.55 MiB)
@btime sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)
1.449 s (1559235 allocations: 895.31 MiB)
For another comparison, I tried running the example multi-GPU script (with CUDA replacing CuArrays) on a machine with two GV100's and got the same kind of performance difference:
using DiffEqGPU, CUDA, OrdinaryDiffEq, Test, BenchmarkTools
CUDA.device!(0)
using Distributed
addprocs(2)
@everywhere using DiffEqGPU, CUDA, OrdinaryDiffEq, Test, Random
@everywhere begin
function lorenz_distributed(du,u,p,t)
du[1] = p[1]*(u[2]-u[1])
du[2] = u[1]*(p[2]-u[3]) - u[2]
du[3] = u[1]*u[2] - p[3]*u[3]
end
CUDA.allowscalar(false)
u0 = Float32[1.0;0.0;0.0]
tspan = (0.0f0,100.0f0)
p = [10.0f0,28.0f0,8/3f0]
Random.seed!(1)
pre_p_distributed = [rand(Float32,3) for i in 1:100_000]
function prob_func_distributed(prob,i,repeat)
remake(prob,p=pre_p_distributed[i].*p)
end
end
@sync begin
@spawnat 2 begin
CUDA.allowscalar(false)
CUDA.device!(0)
end
@spawnat 3 begin
CUDA.allowscalar(false)
CUDA.device!(1)
end
end
CUDA.allowscalar(false)
prob = ODEProblem(lorenz_distributed,u0,tspan,p)
monteprob = EnsembleProblem(prob, prob_func = prob_func_distributed)
@btime sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=100_000,batch_size=50_000,saveat=1.0f0)
# 14.605 s (26457532 allocations: 2.08 GiB)
@btime sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,batch_size=50_000,saveat=1.0f0)
# 104.737 s (189837890 allocations: 38.78 GiB)
It looks like you might be hitting a lot of compilation time? I am not sure if @time counts GPUCompiler compilation time in its reporting of "8.85% compilation time".
It doesn't.
But note the current setup isn't great so we're building a new one which is better for non-stiff ODEs.
Got it, and thanks for the responses!
Also, I noticed that both EnsembleSerial
and EnsembleGPUArray
cause my GPU to jump to ~13% utilization. Is that normal? I would expect the utilization to be much higher for EnsembleGPUArray
.
EnsembleSerial doesn't use the GPU unless you're using GPU arrays. The utilization is dependent on how packaged the kernels are. You want to use like 100,000 trajectories and a big enough ODE to pack the kernels with the current version. That's why we're building a different one that's a lot less limited.
EnsembleGPUKernel is a lot faster, so that's the one to make use of.
Hi! Using the Lorenz example in the README, EnsembleGPUArray seems to be running quite a bit slower than all other methods, including EnsembleSerial. On my machine I get:
Currently on DiffEqGPU v.1.16.0 and OrdinaryDiffEq v6.6.6. GPU is an NVIDIA Quadro T2000, CUDA version 11.6.