Open roflmaostc opened 1 year ago
Your measurements look really weird. Are you sure that's what was actually ran? Do it in the REPL, not a Pluto notebook:
using DiffEqGPU, OrdinaryDiffEq, StaticArrays
function lorenz(u, p, t)
σ = p[1]
ρ = p[2]
β = p[3]
du1 = σ * (u[2] - u[1])
du2 = u[1] * (ρ - u[3]) - u[2]
du3 = u[1] * u[2] - β * u[3]
return SVector{3}(du1, du2, du3)
end
u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem{false}(lorenz, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
@time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
adaptive = false, dt = 0.1f0)
# 9.721686 seconds (13.30 M allocations: 1.775 GiB, 17.76% gc time)
DiffEqGPU.CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10_000,
adaptive = false, dt = 0.1f0)
# 0.022129 seconds (255.12 k CPU allocations: 33.413 MiB) (3 GPU allocations: 12.604 MiB, 0.09% memmgmt time)
#=
julia> versioninfo()
Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 1 on 32 virtual cores
Environment:
JULIA_IMAGE_THREADS = 1
JULIA_EDITOR = code
JULIA_NUM_THREADS =
=#
with a 2080 Super. How your Ryzen 9 5900 is an order of magnitude faster and your 3060 is an order of magnitude slower is... weird to say the least.
@time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000, DiffEqGPU.CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10_000,
did you measure with different numbers of trajectories?
I've been using Pluto, but below my REPL results.
As far as I can see, it uses 12 threads. Is my processor on steroids?
julia> @time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
adaptive = false, dt = 0.1f0)
0.199818 seconds (13.10 M allocations: 1.772 GiB)
julia> DiffEqGPU.CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 100_000,
adaptive = false, dt = 0.1f0)
0.189582 seconds (2.44 M CPU allocations: 330.496 MiB) (3 GPU allocations: 126.038 MiB, 0.16% memmgmt time)
EnsembleSolution Solution of length 100000 with uType:
ODESolution{Float32, 2, uType, Nothing, Nothing, tType, Nothing, P, A, IType, Nothing, Nothing} where {uType, tType, P, A, IType}
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
Threads: 12 on 24 virtual cores
This could be where copying GPU Arrays back to the CPU to build the EnsembleSolution
is more expensive than the whole GPU solution? Try using lower-level API: https://docs.sciml.ai/DiffEqGPU/dev/tutorials/lower_level_api/
Hi,
just copy pasting the example:
I don't see an improvement with the GPU version.
System:
Is there anything wrong or is this expected?
Best,
Felix