GPU-acceleration routines for DifferentialEquations.jl and the broader SciML scientific machine learning ecosystem
Performance example Parameter-Parallelism with GPU Ensemble Methods

Open roflmaostc opened 1 year ago

roflmaostc commented 1 year ago


just copy pasting the example:

using DiffEqGPU, OrdinaryDiffEq, StaticArrays, CUDA

function lorenz(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] * (ρ - u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return SVector{3}(du1, du2, du3)

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]

tspan = (0.0f0, 10.0f0)

p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]

prob = ODEProblem{false}(lorenz, u0, tspan, p)

prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)

monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)

@time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
                  adaptive = false, dt = 0.1f0)

CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 100_000,
                  adaptive = false, dt = 0.1f0)

I don't see an improvement with the GPU version.

  0.182350 seconds (14.10 M allocations: 1.822 GiB)
 0.249538 seconds (3.44 M CPU allocations: 382.376 MiB) (3 GPU allocations: 126.038 MiB, 0.01% memmgmt time)


Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 12 on 24 virtual cores

NVIDIA GeForce RTX 3060 (GPU 0)

Is there anything wrong or is this expected?



ChrisRackauckas commented 1 year ago

Your measurements look really weird. Are you sure that's what was actually ran? Do it in the REPL, not a Pluto notebook:

using DiffEqGPU, OrdinaryDiffEq, StaticArrays

function lorenz(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] * (ρ - u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return SVector{3}(du1, du2, du3)

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem{false}(lorenz, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)

@time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
                    adaptive = false, dt = 0.1f0)
# 9.721686 seconds (13.30 M allocations: 1.775 GiB, 17.76% gc time)

DiffEqGPU.CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10_000,
                  adaptive = false, dt = 0.1f0)

# 0.022129 seconds (255.12 k CPU allocations: 33.413 MiB) (3 GPU allocations: 12.604 MiB, 0.09% memmgmt time)

julia> versioninfo()
Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 32 virtual cores

with a 2080 Super. How your Ryzen 9 5900 is an order of magnitude faster and your 3060 is an order of magnitude slower is... weird to say the least.

roflmaostc commented 1 year ago

@time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000, DiffEqGPU.CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10_000,

did you measure with different numbers of trajectories?

I've been using Pluto, but below my REPL results.

As far as I can see, it uses 12 threads. Is my processor on steroids?

julia> @time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
                           adaptive = false, dt = 0.1f0)
  0.199818 seconds (13.10 M allocations: 1.772 GiB)

julia> DiffEqGPU.CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 100_000,
                         adaptive = false, dt = 0.1f0)
  0.189582 seconds (2.44 M CPU allocations: 330.496 MiB) (3 GPU allocations: 126.038 MiB, 0.16% memmgmt time)
EnsembleSolution Solution of length 100000 with uType:
ODESolution{Float32, 2, uType, Nothing, Nothing, tType, Nothing, P, A, IType, Nothing, Nothing} where {uType, tType, P, A, IType}

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 12 on 24 virtual cores
utkarsh530 commented 1 year ago

This could be where copying GPU Arrays back to the CPU to build the EnsembleSolution is more expensive than the whole GPU solution? Try using lower-level API: