Closed ChrisRackauckas closed 4 years ago
So there shouldn't be a big performance drop like that, the launch overhead might be slightly higher.
In #60 I noticed that you never specified the number of threads to be used directly. And as it turned out @jpsamaroo sneakly turned of the launch config calculation, https://github.com/SciML/DiffEqGPU.jl/pull/60/files#diff-90f5ad9f4eb9fd418f70216b94a00be1R38
So right now we are executing with 256 threads by default, https://github.com/JuliaGPU/KernelAbstractions.jl/blame/4ab11f29b615e72b5ec2112935593fb56309633a/src/backends/cuda.jl#L187
I should do a max
operation there so that for small arrays we don't use a number of threads that is to big.
After marking a few things constant:
0.650184 seconds (2.49 M allocations: 122.845 MiB)
gg KA is my new friend.
@vchuravy is this known?