At the current state of branch Example02.04, SciML with GPUbackend is slower than the CPUbackend.
In examples/test_GPU.jl I was able to identify a problem with the type of $u(t)$ returned from the solver:
At each step, the solver returns a CPU array for $u(t_0+dt)$ even if $u(t_0)$ is a GPU array. It also converts $u(t0)$ from Float32 to Float64 (Float32 is suggested for optimal GPU performances).
At the current state of branch Example02.04, SciML with GPUbackend is slower than the CPUbackend.
In
examples/test_GPU.jl
I was able to identify a problem with the type of $u(t)$ returned from the solver: At each step, the solver returns a CPU array for $u(t_0+dt)$ even if $u(t_0)$ is a GPU array. It also converts $u(t0)$ from Float32 to Float64 (Float32 is suggested for optimal GPU performances).