Sizable performance regression from KernelAbstractions update

ChrisRackauckas commented 4 years ago

using DiffEqGPU, OrdinaryDiffEq
function lorenz(du,u,p,t)
 @inbounds begin
     du[1] = p[1]*(u[2]-u[1])
     du[2] = u[1]*(p[2]-u[3]) - u[2]
     du[3] = u[1]*u[2] - p[3]*u[3]
 end
 nothing
end

u0 = Float32[1.0;0.0;0.0]
tspan = (0.0f0,100.0f0)
p = [10.0f0,28.0f0,8/3f0]
prob = ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob,p=rand(Float32,3).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func)

using BenchmarkTools
@btime sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)

# Before
1.875 s (3299926 allocations: 170.77 MiB)

# After
2.218 s (3388938 allocations: 166.80 MiB)

@vchuravy is this known?

vchuravy commented 4 years ago

So there shouldn't be a big performance drop like that, the launch overhead might be slightly higher.

In #60 I noticed that you never specified the number of threads to be used directly. And as it turned out @jpsamaroo sneakly turned of the launch config calculation, https://github.com/SciML/DiffEqGPU.jl/pull/60/files#diff-90f5ad9f4eb9fd418f70216b94a00be1R38

So right now we are executing with 256 threads by default, https://github.com/JuliaGPU/KernelAbstractions.jl/blame/4ab11f29b615e72b5ec2112935593fb56309633a/src/backends/cuda.jl#L187

I should do a max operation there so that for small arrays we don't use a number of threads that is to big.

ChrisRackauckas commented 4 years ago

After marking a few things constant:

0.650184 seconds (2.49 M allocations: 122.845 MiB)

gg KA is my new friend.

SciML / DiffEqGPU.jl

Sizable performance regression from KernelAbstractions update #62