Solve the GPU bottleneck

At the current state of branch Example02.04, SciML with GPUbackend is slower than the CPUbackend.

In examples/test_GPU.jl I was able to identify a problem with the type of $u(t)$ returned from the solver: At each step, the solver returns a CPU array for $u(t_0+dt)$ even if $u(t_0)$ is a GPU array. It also converts $u(t0)$ from Float32 to Float64 (Float32 is suggested for optimal GPU performances).

DEEPDIP-project / CoupledNODE.jl

Solve the GPU bottleneck #29