Performance example Parameter-Parallelism with GPU Ensemble Methods

roflmaostc commented 1 year ago

Hi,

just copy pasting the example:

# ╔═╡ 0094c9ce-0bf7-40ff-b6df-dcddf94828cd
using DiffEqGPU, OrdinaryDiffEq, StaticArrays, CUDA

# ╔═╡ 0c65ebbf-95b8-44eb-8789-8c03c29442d8
function lorenz(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] * (ρ - u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return SVector{3}(du1, du2, du3)
end

# ╔═╡ faf01799-9b68-463d-b439-504d78ebb2e6
u0 = @SVector [1.0f0; 0.0f0; 0.0f0]

# ╔═╡ fd96be2f-eab0-4cf9-9680-8af8aa5fc54f
tspan = (0.0f0, 10.0f0)

# ╔═╡ 64bb027c-bc7b-443d-a712-8c2c9bf577ac
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]

# ╔═╡ 4b77d5b2-2988-4c2a-ab51-a424a016d85a
prob = ODEProblem{false}(lorenz, u0, tspan, p)

# ╔═╡ f5c3ee38-59b5-4840-9cea-ea2c1a52c512
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)

# ╔═╡ 169bb61d-8a59-4ad1-9936-d594ce5eb2de
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)

# ╔═╡ 097eeba0-3608-498f-9574-b257b6e9604a
@time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
                  adaptive = false, dt = 0.1f0)

# ╔═╡ 5cccc317-125b-4c23-abcf-09292ab9c7e8
CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 100_000,
                  adaptive = false, dt = 0.1f0)

I don't see an improvement with the GPU version.

# CPU
  0.182350 seconds (14.10 M allocations: 1.822 GiB)
# GPU
 0.249538 seconds (3.44 M CPU allocations: 382.376 MiB) (3 GPU allocations: 126.038 MiB, 0.01% memmgmt time)

System:

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 12 on 24 virtual cores
Environment:
  JULIA_REVISE_WORKER_ONLY = 1

NVIDIA GeForce RTX 3060 (GPU 0)

Is there anything wrong or is this expected?

Best,

Felix

ChrisRackauckas commented 1 year ago

Your measurements look really weird. Are you sure that's what was actually ran? Do it in the REPL, not a Pluto notebook:

using DiffEqGPU, OrdinaryDiffEq, StaticArrays

function lorenz(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] * (ρ - u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return SVector{3}(du1, du2, du3)
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem{false}(lorenz, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)

@time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
                    adaptive = false, dt = 0.1f0)
# 9.721686 seconds (13.30 M allocations: 1.775 GiB, 17.76% gc time)

DiffEqGPU.CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10_000,
                  adaptive = false, dt = 0.1f0)

# 0.022129 seconds (255.12 k CPU allocations: 33.413 MiB) (3 GPU allocations: 12.604 MiB, 0.09% memmgmt time)

#=
julia> versioninfo()
Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 32 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =
=#

with a 2080 Super. How your Ryzen 9 5900 is an order of magnitude faster and your 3060 is an order of magnitude slower is... weird to say the least.

roflmaostc commented 1 year ago

@time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000, DiffEqGPU.CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10_000,

did you measure with different numbers of trajectories?

I've been using Pluto, but below my REPL results.

As far as I can see, it uses 12 threads. Is my processor on steroids?

julia> @time sol2 = solve(monteprob, Tsit5(), trajectories = 100_000,
                           adaptive = false, dt = 0.1f0)
  0.199818 seconds (13.10 M allocations: 1.772 GiB)

julia> DiffEqGPU.CUDA.@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 100_000,
                         adaptive = false, dt = 0.1f0)
  0.189582 seconds (2.44 M CPU allocations: 330.496 MiB) (3 GPU allocations: 126.038 MiB, 0.16% memmgmt time)
EnsembleSolution Solution of length 100000 with uType:
ODESolution{Float32, 2, uType, Nothing, Nothing, tType, Nothing, P, A, IType, Nothing, Nothing} where {uType, tType, P, A, IType}

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 5900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 12 on 24 virtual cores

utkarsh530 commented 1 year ago

This could be where copying GPU Arrays back to the CPU to build the EnsembleSolution is more expensive than the whole GPU solution? Try using lower-level API: https://docs.sciml.ai/DiffEqGPU/dev/tutorials/lower_level_api/

SciML / DiffEqGPU.jl

Performance example Parameter-Parallelism with GPU Ensemble Methods #235