sciml_train with Universal ODE fails on GPU

Hi, I'm trying to train the UniversalODE model of the predator-prey example but sciml_train fails (apparently caused by Zygote.gradient)

using DiffEqFlux, Flux, Optim, OrdinaryDiffEq, CUDA, DiffEqSensitivity, Plots

u0 = [1.1; 1.1] |>gpu
tspan = (0.0f0,25.0f0)

ann = FastChain(FastDense(2,16,tanh), FastDense(16,16,tanh), FastDense(16,1)) |>gpu
p1 = initial_params(ann) |>gpu
p2 = Float32[0.5,-0.5]
p3 = [p1;p2]
θ = Float32[u0;p3]

function dudt_(du,u,p,t)
    x, y = u
    du[1] = ann(u,p[1:length(p1)])[1]
    du[2] = p[end-1]*y + p[end]*x
end
prob = ODEProblem(dudt_,u0,tspan,p3)

function predict_adjoint(θ)
  gpu(Array(concrete_solve(prob,Tsit5(),θ[1:2],θ[3:end],saveat=0.0:1:25.0,sensealg=ForwardSensitivity(autojacvec=false, autodiff=true))))
end
loss_adjoint(θ) = sum(abs2,predict_adjoint(θ)[2,:].-1)
l = loss_adjoint(θ)

cb = function (θ,l)
  println(l)
  #display(plot(solve(remake(prob,p=Flux.data(p3),u0=Flux.data(u0)),Tsit5(),saveat=0.1),ylim=(0,6)))
  return false
end

loss1 = loss_adjoint(θ)
res = DiffEqFlux.sciml_train(loss_adjoint, θ, ADAM(), cb = cb, maxiters=10)

I get the following error:

ArgumentError: cannot take the CPU address of a CuArrays.CuArray{Float32,1,CuArrays.CuArray{Float32,1,Nothing}}
in top-level scope at Ml3.jl:38
in  at DiffEqFlux\03BAI\src\train.jl:77
in sciml_train##kw at DiffEqFlux\03BAI\src\train.jl:77 
in #sciml_train#22 at DiffEqFlux\03BAI\src\train.jl:42
in maybe_with_logger at DiffEqBase\V7P18\src\utils.jl:259
in  at DiffEqFlux\03BAI\src\train.jl:43
in macro expansion at ProgressLogging\BBN0b\src\ProgressLogging.jl:328 
in macro expansion at DiffEqFlux\03BAI\src\train.jl:98 
in gradient at Zygote\YeCEW\src\compiler\interface.jl:54
in pullback at Zygote\YeCEW\src\compiler\interface.jl:174
in _pullback at Zygote\YeCEW\src\compiler\interface2.jl
in #24 at DiffEqFlux\03BAI\src\train.jl:99 
in _pullback at ZygoteRules\6nssF\src\adjoint.jl:47 
in adjoint at Zygote\YeCEW\src\lib\lib.jl:179 
in _pullback at Zygote\YeCEW\src\compiler\interface2.jl
in loss_adjoint at Ml3.jl:25 
in _pullback at Zygote\YeCEW\src\compiler\interface2.jl
in predict_adjoint at Ml3.jl:23 
in _pullback at ZygoteRules\6nssF\src\adjoint.jl:53
in adjoint##kw at base\none 
in #adjoint#478 at DiffEqBase\V7P18\src\solve.jl:298 
in _concrete_solve_adjoint##kw at DiffEqSensitivity\0vkh1\src\local_sensitivity\concrete_solve.jl:98 
in #_concrete_solve_adjoint#96 at DiffEqSensitivity\0vkh1\src\local_sensitivity\concrete_solve.jl:99
in  at DiffEqBase\V7P18\src\solve.jl:100
in #solve#460 at DiffEqBase\V7P18\src\solve.jl:102 
in solve_up##kw at DiffEqBase\V7P18\src\solve.jl:107 
in #solve_up#461 at DiffEqBase\V7P18\src\solve.jl:114 
in solve_call##kw at DiffEqBase\V7P18\src\solve.jl:65 
in #solve_call#457 at DiffEqBase\V7P18\src\solve.jl:92
in  at OrdinaryDiffEq\JrtsK\src\solve.jl:4
in #__solve#357 at OrdinaryDiffEq\JrtsK\src\solve.jl:4 
in __init##kw at OrdinaryDiffEq\JrtsK\src\solve.jl:67 
in __init##kw at OrdinaryDiffEq\JrtsK\src\solve.jl:67 
in __init##kw at OrdinaryDiffEq\JrtsK\src\solve.jl:67 
in __init##kw at OrdinaryDiffEq\JrtsK\src\solve.jl:67 
in  at OrdinaryDiffEq\JrtsK\src\solve.jl:67
in #__init#358 at OrdinaryDiffEq\JrtsK\src\solve.jl:402
in initialize! at OrdinaryDiffEq\JrtsK\src\perform_step\low_order_rk_perform_step.jl:623
in  at DiffEqSensitivity\0vkh1\src\local_sensitivity\forward_sensitivity.jl:76
in mul! at stdlib\v1.4\LinearAlgebra\src\matmul.jl:208 
in mul! at stdlib\v1.4\LinearAlgebra\src\matmul.jl:66 
in gemv! at stdlib\v1.4\LinearAlgebra\src\matmul.jl:470
in gemv! at stdlib\v1.4\LinearAlgebra\src\blas.jl:587
in unsafe_convert at CuArrays\YFdj7\src\array.jl:226

I would be very grateful for any idea to fix it.

i get the same error in another example with the gpu.

I mean, if you want this to work you would do:

using DiffEqFlux, Flux, Optim, OrdinaryDiffEq, CUDA, DiffEqSensitivity, Plots

u0 = [1.1; 1.1] |>gpu
tspan = (0.0f0,25.0f0)

ann = FastChain(FastDense(2,16,tanh), FastDense(16,16,tanh), FastDense(16,1)) |>gpu
p1 = initial_params(ann) |>gpu
p2 = Float32[0.5,-0.5]
p3 = [p1;p2]
θ = Float32[u0;p3]

function dudt_(du,u,p,t)
    x, y = CUDA.@allowscalar(u[1]),CUDA.@allowscalar(u[2])
    du1 = CUDA.@allowscalar(ann(u,p[1:length(p1)])[1])
    du2 = CUDA.@allowscalar(p[end-1])*y + CUDA.@allowscalar(p[end])*x
    du .= gpu([du1,du2])
end
prob = ODEProblem(dudt_,u0,tspan,p3)

function predict(θ)
  gpu(Array(solve(prob,Tsit5(),u0=θ[1:2],p=θ[3:end],saveat=0.0:1:25.0,sensealg=ForwardDiffSensitivity())))
end
loss(θ) = sum(abs2,predict(θ)[2,:].-1)
l = loss(θ)

cb = function (θ,l)
  println(l)
  #display(plot(solve(remake(prob,p=Flux.data(p3),u0=Flux.data(u0)),Tsit5(),saveat=0.1),ylim=(0,6)))
  return false
end

loss1 = loss(θ)
res = DiffEqFlux.sciml_train(loss, θ, ADAM(), cb = cb, maxiters=10)

I mean this in the nicest way possible: what you're trying to do is very misguided. But I wouldn't recommend it: it's god awfully slow. This code demonstrates quite a few misunderstandings, so let me walk through this I bit.

First of all, I see that GPUs are SPMD machines which essentially compute the same operation on all cores simultaneously. This means you need to have operations that do "5000 of the same thing" in order to do effective GPU computing. The classic example of this is matrix multiplication, since it corresponds to doing a dot-product between every row and column, and thus N operations for N^2 things. At around a 100x100 matrix, this just starts to break even where a standard GPU with ~5,000 cores now does 2 dot products each which just barely fills the operations long enough to overcome the kernel startup cost.

However, all operations done on the GPU act the same way. If you make 1 core do a +, then all 5,000 cores will do a + operation, and if your function call is simply a single + operation, this means you'll have 1 core do work and 4,999 do fake work. You can see very quickly how this will amount to being a ton of fake work on any SPMD architecture, meaning that

function dudt_(du,u,p,t)
    x, y = u
    du[1] = ann(u,p[1:length(p1)])[1]
    du[2] = p[end-1]*y + p[end]*x
end

is an algorithm that is doing a lot of kernel calls but very little GPU work, and thus is something that would be very inefficient. You can make it work by forcing the single scalar operations to occur:

function dudt_(du,u,p,t)
    x, y = CUDA.@allowscalar(u[1]),CUDA.@allowscalar(u[2])
    du1 = CUDA.@allowscalar(ann(u,p[1:length(p1)])[1])
    du2 = CUDA.@allowscalar(p[end-1])*y + CUDA.@allowscalar(p[end])*x
    du .= gpu([du1,du2])
end

But again that's not efficient. Also, the neural network as specified is at most doing 16x16 x 16x16 matrix multiplications, which is not large enough to overcome the kernel cost, so nothing in that function is good for GPUs.

Now let's assume you have a very huge neural network in this kind of code, what should you do? Well, again the scalar operations would be deadly on the GPU, so the most efficient way to handle it would be to keep the neural network on the GPU and the differential equation on the CPU and offload to perform the NN evaluations. That would look like this:

using DiffEqFlux, Flux, Optim, OrdinaryDiffEq, CUDA, DiffEqSensitivity, Plots

u0 = [1.1; 1.1] |> gpu
tspan = (0.0f0,25.0f0)

ann = FastChain(FastDense(2,16,tanh), FastDense(16,16,tanh), FastDense(16,1)) |>gpu
p1 = initial_params(ann) |>gpu
p2 = Float32[0.5,-0.5]
p3 = [p1;p2]
θ = Float32[u0;p3]

function dudt_(u,p,t)
    x, y = u
    [cpu(ann(gpu(u),p[1:length(p1)]))[1],p[end-1]*y + p[end]*x]
end
prob = ODEProblem{false}(dudt_,u0,tspan,p3)

function predict_adjoint(θ)
  gpu(Array(solve(prob,Tsit5(),u0=cpu(θ[1:2]),p=θ[3:end],saveat=0.0:1:25.0,sensealg=ForwardDiffSensitivity())))
end
loss_adjoint(θ) = sum(abs2,predict_adjoint(θ)[2,:].-1)
l = loss_adjoint(θ)

cb = function (θ,l)
  println(l)
  #display(plot(solve(remake(prob,p=Flux.data(p3),u0=Flux.data(u0)),Tsit5(),saveat=0.1),ylim=(0,6)))
  return false
end

loss1 = loss_adjoint(θ)
res = DiffEqFlux.sciml_train(loss_adjoint, θ, ADAM(), cb = cb, maxiters=10)

Again, you'd need a much larger neural network for this to actually be faster, but this would actually be useful in a lot of these UDE training contexts.

But it's still slow for one last reason. You chose sensealg=ForwardSensitivity(autojacvec=false, autodiff=true). First of all, that version doesn't have very high SIMD potential, so sensealg=ForwardDiffSensitivity() is what you really want for forward mode here, but secondly, you don't want to use forward mode here. https://arxiv.org/abs/1812.01892 documents quite well how the cutoff is at 100 parameters. Basically, forward mode is O(np) with a small constant and reverse mode is O(n+p) with a bigger constant, and so <100 parameters you want forward mode and >100 parameters you want reverse mode. So... you really want to use reverse mode here, and pretty much any case with a neural network, since even this tiny neural network is >300 parameters. So that code above is partially slow because of the GPU evaluation of small matmuls but also a big factor is forward mode.

So in total, all of these factors combined is why the example is CPU + reverse mode. It's not just an example way to do the calculation, but it's also the most efficient. However, what this is bringing up is that, as that neural network becomes asymtopically large, the most efficient way to do this computation would be a mixed CPU+GPU strategy with reverse mode, which would look like:

using DiffEqFlux, Flux, Optim, OrdinaryDiffEq, CUDA, DiffEqSensitivity, Plots

u0 = [1.1; 1.1] |> gpu
tspan = (0.0f0,25.0f0)

ann = FastChain(FastDense(2,16,tanh), FastDense(16,16,tanh), FastDense(16,1)) |>gpu
p1 = initial_params(ann) |>gpu
p2 = Float32[0.5,-0.5]
p3 = [p1;p2]
θ = Float32[u0;p3]

function dudt_(u,p,t)
    x, y = u
    [cpu(ann(gpu(u),p[1:length(p1)]))[1],p[end-1]*y + p[end]*x]
end
prob = ODEProblem{false}(dudt_,u0,tspan,p3)

function predict_adjoint(θ)
  gpu(Array(solve(prob,Tsit5(),u0=cpu(θ[1:2]),p=θ[3:end],saveat=0.0:1:25.0,sensealg=InterpolatingAdjoint())))
end
loss_adjoint(θ) = sum(abs2,predict_adjoint(θ)[2,:].-1)
l = loss_adjoint(θ)

cb = function (θ,l)
  println(l)
  #display(plot(solve(remake(prob,p=Flux.data(p3),u0=Flux.data(u0)),Tsit5(),saveat=0.1),ylim=(0,6)))
  return false
end

loss1 = loss_adjoint(θ)
res = DiffEqFlux.sciml_train(loss_adjoint, θ, ADAM(), cb = cb, maxiters=10)

I say "would look like" because this investigation revealed that this mixture doesn't work right now, and it would be a good one to get working. I opened a clean https://github.com/SciML/DiffEqFlux.jl/issues/401 to track this.

In the end, this issue is really about "what is a GPU good for?" and the answer is that it's good for matrix multiplications, which makes it an acceleration tool for machine learning and PDE calculations, but it doesn't map quite well to scalarized nonlinear codes like in chemical reaction networks. That said, there are two things you can do. One I mentioned, if the NN is large enough. The second thing you could do is GPU parallelize across the different ODE solves, which is documented here: https://github.com/SciML/DiffEqGPU.jl#parameter-parallelism-with-gpu-ensemble-methods . That is good if you have lots of small ODEs, though you'll overload your GPU quickly if you have too large of ODEs or too many parameters per system.

The other real solution is to use accelerator hardware that doesn't require SPMD programs. Graphcore's IPU is a MIMD device, and @msaroufim this is precisely an issue that demonstrates how a purely SPMD model breaks down in scientific machine learning and why I want to target the IPU.

I'm closing this since there's not much else actionable here, but https://github.com/SciML/DiffEqFlux.jl/issues/401 is something that we should make sure to do. Hopefully that is a very complete answer that describes both why the docs are the way they are and all of the ways to do different things, but why they aren't necessarily great ideas.

I mean, if you want this to work you would do:
using DiffEqFlux, Flux, Optim, OrdinaryDiffEq, CUDA, DiffEqSensitivity, Plots

u0 = [1.1; 1.1] |>gpu
tspan = (0.0f0,25.0f0)

ann = FastChain(FastDense(2,16,tanh), FastDense(16,16,tanh), FastDense(16,1)) |>gpu
p1 = initial_params(ann) |>gpu
p2 = Float32[0.5,-0.5]
p3 = [p1;p2]
θ = Float32[u0;p3]

function dudt_(du,u,p,t)
    x, y = CUDA.@allowscalar(u[1]),CUDA.@allowscalar(u[2])
    du1 = CUDA.@allowscalar(ann(u,p[1:length(p1)])[1])
    du2 = CUDA.@allowscalar(p[end-1])*y + CUDA.@allowscalar(p[end])*x
    du .= gpu([du1,du2])
end
prob = ODEProblem(dudt_,u0,tspan,p3)

function predict(θ)
  gpu(Array(solve(prob,Tsit5(),u0=θ[1:2],p=θ[3:end],saveat=0.0:1:25.0,sensealg=ForwardDiffSensitivity())))
end
loss(θ) = sum(abs2,predict(θ)[2,:].-1)
l = loss(θ)

cb = function (θ,l)
  println(l)
  #display(plot(solve(remake(prob,p=Flux.data(p3),u0=Flux.data(u0)),Tsit5(),saveat=0.1),ylim=(0,6)))
  return false
end

loss1 = loss(θ)
res = DiffEqFlux.sciml_train(loss, θ, ADAM(), cb = cb, maxiters=10)
I mean this in the nicest way possible: what you're trying to do is very misguided. But I wouldn't recommend it: it's god awfully slow. This code demonstrates quite a few misunderstandings, so let me walk through this I bit.

First of all, I see that GPUs are SPMD machines which essentially compute the same operation on all cores simultaneously. This means you need to have operations that do "5000 of the same thing" in order to do effective GPU computing. The classic example of this is matrix multiplication, since it corresponds to doing a dot-product between every row and column, and thus N operations for N^2 things. At around a 100x100 matrix, this just starts to break even where a standard GPU with ~5,000 cores now does 2 dot products each which just barely fills the operations long enough to overcome the kernel startup cost.

However, all operations done on the GPU act the same way. If you make 1 core do a +, then all 5,000 cores will do a + operation, and if your function call is simply a single + operation, this means you'll have 1 core do work and 4,999 do fake work. You can see very quickly how this will amount to being a ton of fake work on any SPMD architecture, meaning that
function dudt_(du,u,p,t)
    x, y = u
    du[1] = ann(u,p[1:length(p1)])[1]
    du[2] = p[end-1]*y + p[end]*x
end
is an algorithm that is doing a lot of kernel calls but very little GPU work, and thus is something that would be very inefficient. You can make it work by forcing the single scalar operations to occur:
function dudt_(du,u,p,t)
    x, y = CUDA.@allowscalar(u[1]),CUDA.@allowscalar(u[2])
    du1 = CUDA.@allowscalar(ann(u,p[1:length(p1)])[1])
    du2 = CUDA.@allowscalar(p[end-1])*y + CUDA.@allowscalar(p[end])*x
    du .= gpu([du1,du2])
end
But again that's not efficient. Also, the neural network as specified is at most doing 16x16 x 16x16 matrix multiplications, which is not large enough to overcome the kernel cost, so nothing in that function is good for GPUs.

Now let's assume you have a very huge neural network in this kind of code, what should you do? Well, again the scalar operations would be deadly on the GPU, so the most efficient way to handle it would be to keep the neural network on the GPU and the differential equation on the CPU and offload to perform the NN evaluations. That would look like this:
using DiffEqFlux, Flux, Optim, OrdinaryDiffEq, CUDA, DiffEqSensitivity, Plots

u0 = [1.1; 1.1] |> gpu
tspan = (0.0f0,25.0f0)

ann = FastChain(FastDense(2,16,tanh), FastDense(16,16,tanh), FastDense(16,1)) |>gpu
p1 = initial_params(ann) |>gpu
p2 = Float32[0.5,-0.5]
p3 = [p1;p2]
θ = Float32[u0;p3]

function dudt_(u,p,t)
    x, y = u
    [cpu(ann(gpu(u),p[1:length(p1)]))[1],p[end-1]*y + p[end]*x]
end
prob = ODEProblem{false}(dudt_,u0,tspan,p3)

function predict_adjoint(θ)
  gpu(Array(solve(prob,Tsit5(),u0=cpu(θ[1:2]),p=θ[3:end],saveat=0.0:1:25.0,sensealg=ForwardDiffSensitivity())))
end
loss_adjoint(θ) = sum(abs2,predict_adjoint(θ)[2,:].-1)
l = loss_adjoint(θ)

cb = function (θ,l)
  println(l)
  #display(plot(solve(remake(prob,p=Flux.data(p3),u0=Flux.data(u0)),Tsit5(),saveat=0.1),ylim=(0,6)))
  return false
end

loss1 = loss_adjoint(θ)
res = DiffEqFlux.sciml_train(loss_adjoint, θ, ADAM(), cb = cb, maxiters=10)
Again, you'd need a much larger neural network for this to actually be faster, but this would actually be useful in a lot of these UDE training contexts.

But it's still slow for one last reason. You chose sensealg=ForwardSensitivity(autojacvec=false, autodiff=true). First of all, that version doesn't have very high SIMD potential, so sensealg=ForwardDiffSensitivity() is what you really want for forward mode here, but secondly, you don't want to use forward mode here. https://arxiv.org/abs/1812.01892 documents quite well how the cutoff is at 100 parameters. Basically, forward mode is O(np) with a small constant and reverse mode is O(n+p) with a bigger constant, and so <100 parameters you want forward mode and >100 parameters you want reverse mode. So... you really want to use reverse mode here, and pretty much any case with a neural network, since even this tiny neural network is >300 parameters. So that code above is partially slow because of the GPU evaluation of small matmuls but also a big factor is forward mode.

So in total, all of these factors combined is why the example is CPU + reverse mode. It's not just an example way to do the calculation, but it's also the most efficient. However, what this is bringing up is that, as that neural network becomes asymtopically large, the most efficient way to do this computation would be a mixed CPU+GPU strategy with reverse mode, which would look like:
using DiffEqFlux, Flux, Optim, OrdinaryDiffEq, CUDA, DiffEqSensitivity, Plots

u0 = [1.1; 1.1] |> gpu
tspan = (0.0f0,25.0f0)

ann = FastChain(FastDense(2,16,tanh), FastDense(16,16,tanh), FastDense(16,1)) |>gpu
p1 = initial_params(ann) |>gpu
p2 = Float32[0.5,-0.5]
p3 = [p1;p2]
θ = Float32[u0;p3]

function dudt_(u,p,t)
    x, y = u
    [cpu(ann(gpu(u),p[1:length(p1)]))[1],p[end-1]*y + p[end]*x]
end
prob = ODEProblem{false}(dudt_,u0,tspan,p3)

function predict_adjoint(θ)
  gpu(Array(solve(prob,Tsit5(),u0=cpu(θ[1:2]),p=θ[3:end],saveat=0.0:1:25.0,sensealg=InterpolatingAdjoint())))
end
loss_adjoint(θ) = sum(abs2,predict_adjoint(θ)[2,:].-1)
l = loss_adjoint(θ)

cb = function (θ,l)
  println(l)
  #display(plot(solve(remake(prob,p=Flux.data(p3),u0=Flux.data(u0)),Tsit5(),saveat=0.1),ylim=(0,6)))
  return false
end

loss1 = loss_adjoint(θ)
res = DiffEqFlux.sciml_train(loss_adjoint, θ, ADAM(), cb = cb, maxiters=10)
I say "would look like" because this investigation revealed that this mixture doesn't work right now, and it would be a good one to get working. I opened a clean #401 to track this.

In the end, this issue is really about "what is a GPU good for?" and the answer is that it's good for matrix multiplications, which makes it an acceleration tool for machine learning and PDE calculations, but it doesn't map quite well to scalarized nonlinear codes like in chemical reaction networks. That said, there are two things you can do. One I mentioned, if the NN is large enough. The second thing you could do is GPU parallelize across the different ODE solves, which is documented here: https://github.com/SciML/DiffEqGPU.jl#parameter-parallelism-with-gpu-ensemble-methods . That is good if you have lots of small ODEs, though you'll overload your GPU quickly if you have too large of ODEs or too many parameters per system.

The other real solution is to use accelerator hardware that doesn't require SPMD programs. Graphcore's IPU is a MIMD device, and @msaroufim this is precisely an issue that demonstrates how a purely SPMD model breaks down in scientific machine learning and why I want to target the IPU.

I'm closing this since there's not much else actionable here, but #401 is something that we should make sure to do. Hopefully that is a very complete answer that describes both why the docs are the way they are and all of the ways to do different things, but why they aren't necessarily great ideas.

Thanks for the very detailed answer! My intention is actually closest to the second case you mention above, to speed up the calculations with larger NNs with more complex use cases in mind (I'm currently working on physics based - machine learnable hybrid modelling of aeroelastic flutter). For this purpose, it would be really useful for me to have the other (actually useful :) ) adjoints included as ForwardDiffSensitivity() on the gpu doesn't seem to provide any gain in speed when compared to other adjoints working with cpu-only codes.

Yup for that we can follow the MWE at https://github.com/SciML/DiffEqFlux.jl/issues/401 . I'll probably take a look this weekend. I'm actually surprised it didn't just work, though I never tried doing that before.

SciML / DiffEqFlux.jl

sciml_train with Universal ODE fails on GPU #400