ODE performance benchmark

baggepinnen commented 4 years ago

As discussed on slack, here is a benchmark that is solving an ODE where each operation is carried out on a CuArray

# Installation
# using Pkg
# pkg"add MonteCarloMeasurements#gpu"
# pkg"add OrdinaryDiffEq ChangePrecision"

using CuArrays
using MonteCarloMeasurements, OrdinaryDiffEq, ChangePrecision, CuArrays, LinearAlgebra

function sim((±)::F, tspan) where F
    @changeprecision Float32 begin
        g = 9.79 ± 0.02; # Gravitational constant
        L = 1.00 ± 0.01; # Length of the pendulum
        u₀ = [0.0 ± 0.0, π / 3.0 ± 0.02] # Initial speed and initial angle
        gL = g/L
        function simplependulum(du,u,p,t)
            θ  = u[1]
            dθ = u[2]
            du[1] = dθ
            du[2] = -gL * sin(θ)
        end
        prob = ODEProblem(simplependulum, u₀, tspan)
        sol = solve(prob, Tsit5(), reltol = 1e-6, save_everystep=false, dense=false) # save_everystep=false, dense=false is required to not run out of GPU memory
    end
end

function naive_mc(tspan, n)
    for i = 1:n
        sim((x,y)->x+y*randn(), tspan)
    end
end

tspan = (0.0f0, 100f0) # Duration of simulation

CuArrays.reclaim()
CuArrays.allowscalar(false) # This will lead to an error in OrdinaryDiffEq/initdt.jl line 73, uncomment that line and hotpatch the function

# With 1e5 samples it goes about 2.5 times faster than the naive sim
@time sim((x,y)->MonteCarloMeasurements.CuParticles(fill(x,100_000) .+ y .* randn.()), tspan) # This takes around 19 seconds on my GPU, GTX 770
# 19.738116 seconds (30.17 M allocations: 33.190 GiB, 9.03% gc time)

# With 1e6 samples more than 50% of the time is spent in GC
@time sim((x,y)->MonteCarloMeasurements.CuParticles(fill(x,1_000_000) .+ y .* randn.()), tspan)
# 387.462061 seconds (30.24 M allocations: 321.369 GiB, 68.33% gc time)

@time naive_mc(tspan, 10000)
# 47.677040 seconds (46.87 M allocations: 3.223 GiB, 1.04% gc time)

maleadt commented 4 years ago

Are you not comparing CPU execution for n=10000 (47 seconds) with GPU execution of n=100_000 (19 seconds)? Comparing with the following:

function main(N=100_000, tspan=(0.0f0, 100f0))
    @time naive_mc(tspan, N)

    CuArrays.reclaim()
    CuArrays.@time sim((x,y)->MonteCarloMeasurements.CuParticles(fill(x,N) .+ y .* randn.()), tspan)

    return
end

I get:

julia> main(100_000)
 31.384719 seconds (36.60 M allocations: 2.816 GiB, 1.06% gc time)
  5.914546 seconds (18.54 M CPU allocations: 16.603 GiB, 12.21% gc time) (376.35 k GPU allocations: 273.562 GiB, 7.35% gc time of which 17.34% spent allocating)

julia> main(1_000_000)
319.073272 seconds (366.00 M allocations: 28.163 GiB, 0.97% gc time)
 48.553030 seconds (19.81 M CPU allocations: 160.754 GiB, 9.64% gc time) (376.59 k GPU allocations: 2.673 TiB, 1.00% gc time of which 7.64% spent allocating)

Of course, this is on a RTX 5000 with 16GB of memory while your GPU only has 2GB. Trying to replicate that:

$ CUARRAYS_MEMORY_LIMIT="2000000000" julia

julia> main(100_000)
  5.675215 seconds (18.43 M CPU allocations: 16.617 GiB, 9.03% gc time) (376.59 k GPU allocations: 273.741 GiB, 18.33% gc time of which 7.87% spent allocating)

julia> main(1_000_000)
 51.856641 seconds (29.95 M CPU allocations: 161.215 GiB, 11.74% gc time) (376.59 k GPU allocations: 2.673 TiB, 6.12% gc time of which 1.17% spent allocating)

This still looks OK, 6x improvement across the board... Which version of Julia/CuArrays are you using?

You can also try to use the split allocator, but it doesn't seem to perform better on this workload:

$ CUARRAYS_MEMORY_LIMIT="2000000000" CUARRAYS_MEMORY_POOL=split julia

julia> main(100_000)
  6.613166 seconds (21.05 M CPU allocations: 16.758 GiB, 9.77% gc time) (376.59 k GPU allocations: 273.741 GiB, 10.65% gc time of which 11.62% spent allocating)

julia> main(1_000_000)
 65.764215 seconds (32.53 M CPU allocations: 161.358 GiB, 29.77% gc time) (376.59 k GPU allocations: 2.673 TiB, 28.68% gc time of which 6.69% spent allocating)

maleadt commented 4 years ago

Finally, also try CuArrays 1.6.1 (release incoming) with Julia 1.4 or master, the GC interactions there are much less costly.

baggepinnen commented 4 years ago

Wow you're that's quite a bit better result. My timing were actually correct and fair, bu the code I posted had an incorrect number for naive_mc.

Trying CuArrays#master on julia nightly I actually get slightly slower performance

main(10_000)
println()
main(100_000)
println()
main(1_000_000)

  9.387509 seconds (27.63 M CPU allocations: 4.926 GiB, 4.65% gc time) (367.88 k GPU allocations: 27.409 GiB, 9.10% gc time of which 16.30% spent allocating)
  4.415120 seconds (3.71 M allocations: 289.459 MiB, 0.80% gc time)

 17.178087 seconds (29.71 M CPU allocations: 33.135 GiB, 7.82% gc time) (376.35 k GPU allocations: 273.562 GiB, 5.09% gc time of which 12.56% spent allocating)
 43.949634 seconds (37.10 M allocations: 2.827 GiB, 0.85% gc time)

405.047600 seconds (29.89 M CPU allocations: 321.338 GiB, 69.88% gc time) (376.59 k GPU allocations: 2.673 TiB, 81.28% gc time of which 0.17% spent allocating)

I should note that I'm running a 4K monitor on the same GPU so it actually only has about 1GiB free for compute. In either case, a beefier GPU seems to provide an OK speedup and I have some updating to do in my paper :)

maleadt commented 4 years ago

  9.387509 seconds (27.63 M CPU allocations: 4.926 GiB, 4.65% gc time) (367.88 k GPU allocations: 27.409 GiB, 9.10% gc time of which 16.30% spent allocating)
  4.415120 seconds (3.71 M allocations: 289.459 MiB, 0.80% gc time)

Is that after warm-up? But yeah, 1GB is not much so it's expected to put a lot of pressure on the GC.

maleadt commented 4 years ago

Closing this assuming it isn't a problem. Please reopen if you expect this benchmark to perform well on this limited amount of memory.

JuliaGPU / CuArrays.jl

ODE performance benchmark #566