DifferentialEquations.jl timestepping?

ChrisRackauckas commented 4 years ago

I dug through the code and DifferentialEquations.jl could add quite a bit to the efficiency of the time stepping code. For reference, what it would really add is SSP methods:

http://docs.juliadiffeq.org/latest/solvers/ode_solve.html#Explicit-Strong-Stability-Preserving-Runge-Kutta-Methods-for-Hyperbolic-PDEs-(Conservation-Laws)-1

Some low storage RKs:

http://docs.juliadiffeq.org/latest/solvers/ode_solve.html#Low-Storage-Methods-1

(number of registers is verified to be the theoretical minimal in tests), and a bunch more IMEX methods:

http://docs.juliadiffeq.org/latest/solvers/split_ode_solve.html

The interface can effectively stay the same. The only difficulty is the following:

https://github.com/climate-machine/CLIMA/blob/master/src/ODESolvers/AdditiveRungeKuttaMethod.jl#L354-L356

If you encapsulate this whole information into an array to to generate a broadcast that is the element-wise update of the array, then this code is generic. I don't know if you guys are working on a heterogeneous array type but that's effectively "all" we'd need for this to work. Since DiffEq only works through the broadcast expressions (verified by CuArray support), if this array type is setup then all of the operations would stay distributed (though you may need to use your linear solver, which is quite simple since you'd just use the linsolve interface).

If anyone else is interested in this, let's discuss more. I'd be willing to collaborate, since I think a very interesting benchmarking paper could come out of this.

jkozdon commented 4 years ago

Hello @ChrisRackauckas thanks for taking a look at our code. We'd love to have DifferentialEquations.jl support in CLIMA. Like you pointed out, if this were done we do get a number of benefits (such as adaptive time stepping).

If you (or someone else) could show us how to use the package with CLIMA that would be awesome.

I'd be particularly interested to see the GPU performance (number of kernel calls, memory bandwidth achieved, GPU register usage, etc) for our handrolled LSRK methods versus using DifferentialEquations.jl.

When we initially started building what has become CLIMA, we did take a look at DifferentialEquations.jl but quickly ran into problems with GPU and MPI support. Since we (@lcw, @fxgiraldo, and I) had written a number of these PDE solvers in the past and knew how to quickly generate efficient code for the low storage methods we fell back on that. We did try to design the interface to our ODE solvers in similar manner to DifferentialEquations.jl with the thought that someone could add DifferentialEquations.jl as a backend in the future.

Most of our array type is in what we call the MPIStateArray, which does support broadcasting. Don't know if that supports all you'd need to do this, but that's the place to start looking. (The MPIStateArray was originally just at wrapper to hide MPI calls, and not a "real" Julia array. It has morphed and evolved to be more of a true Array, but certainly not complete and there are all sorts of weirdness to it and things that are not used.)

lcw commented 4 years ago

Thanks for looking through the code! I think it would be great to be able to use DifferentialEquations.jl. I believe @mwarusz added the necessary bits for our array type to do broadcasting so it should be ready to go 🤞 .

As it stands now the 14 stage low storage RK scheme is our most efficient time to solution.

I have always wanted to see if we could get a factor of 2 increase in time to solution using a PI-controller based adaptive time stepping scheme.

However, I don't see where else the efficiency (in terms of time to solution) would come from by using DifferentialEquations.jl. Could you expand on this more? For example this is the first time I have heard the term efficient together with SSP methods. SSP methods are great at what they do but they typically severely limit the time step size if you want their SSP property.

Currently the limit on IMEX is not from the ODE solver but from the linear solve. We need better preconditioners to be able to increase the time step size. I don't think DifferentialEquations.jl can help us there.

fxgiraldo commented 4 years ago

Hi @ChrisRackauckas , sorry to inundate you with replies but it must mean we are interested in exploring the possibility of using DifferentialEquations.jl within CLIMA. Adding on to what @jkozdon and @lcw have already written, I would be interested in seeing how some of the methods already implemented in DifferentialEquations.jl could help us (e.g., those optimized specifically for compressible Navier-Stokes or DG discretization - I assume these are from David Ketcheson since we went back and forth on some of these methods a while back). Like Lucas mentioned, the SSP property is great but too time-step size restrictive since we want to maximize our time-to-solution as much as possible. I would also be interested in testing out the extrapolation methods already implemented (since I have a keen interest in them) and also would be interesting in hearing anyone's experience with the exponential time integrators already implemented.

ChrisRackauckas commented 4 years ago

I'd be particularly interested to see the GPU performance (number of kernel calls, memory bandwidth achieved, GPU register usage, etc) for our handrolled LSRK methods versus using DifferentialEquations.jl.

Your kernels should be large enough that the op overhead should not be even noticable. We have OrdinaryDiffEq.jl as our full-featured ones, and then we have SimpleDiffEq.jl as our hand-unrolled testing code on the same interface (with some less feature support) (this library was created for testing compilers and overhead). From our tests, OrdinaryDiffEq.jl does have overhead on the order of about 5-10 ns IIRC, which is noticeable on scalar ODEs (due to checking a few extra scalar branches for things like "any NaNs?"), but quickly falls of and isn't big by the time you get to 10 ODEs. tl;dr: for PDEs this shouldn't have an overhead.

So one thing we could do is just make sure that SimpleDiffEq has the low storage RK method you want, and keep that for direct comparisons. But you'll find that GPU kernel calls outweigh the simple checks. In fact, DiffEq's internal implementation of things like RK methods is essentially an unrolled version, for example:

https://github.com/JuliaDiffEq/OrdinaryDiffEq.jl/blob/master/src/perform_step/low_storage_rk_perform_step.jl#L369-L402

This is why we have separate algorithm types for different ones instead of just building one which works on a tableau matrix.

You can see from that code example too that our only interaction with your data is through calling your f or through a broadcast, so if you manage that then it's parallel (which is how we support distributed and GPU arrays).

I have always wanted to see if we could get a factor of 2 increase in time to solution using a PI-controller based adaptive time stepping scheme.

Indeed, that's what I'm curious about. We're also trying to use differentiable programming to tune PID-controlled schemes, so seeing whether you can train on one set of ODEs and get a faster climate model is one thing I really want to try (form of transfer learning).

For example this is the first time I have heard the term efficient together with SSP methods. SSP methods are great at what they do but they typically severely limit the time step size if you want their SSP property

If you don't need SSP, then you shouldn't use an SSP method. I wasn't sure whether you were restricting to monotone solutions already. If you aren't, I'd just use the LSRKs

http://docs.juliadiffeq.org/latest/solvers/ode_solve.html#Low-Storage-Methods-1

Currently the limit on IMEX is not from the ODE solver but from the linear solve. We need better preconditioners to be able to increase the time step size. I don't think DifferentialEquations.jl can help us there.

Actually there's a trick we use from Sundials which helps out a ton in the Newton-Krylov portion of this. You want to do a certain scaling and such before the application of a preconditioner (a form of pre-preconditioning). You can somewhat parse how we're doing it in here:

https://github.com/JuliaDiffEq/DiffEqBase.jl/blob/master/src/linear_nonlinear.jl#L146-L154

Also, the purpose of things like the KenCarp4 is to reduce the total number of factorizations required. If you get a 4x reduction in the number of factorizations, then you still get a speedup. If you use a higher order method, you can then just take larger timesteps. If you needed the value at intermediate points, you can embed an interpolation scheme which uses already computed values to get the intermediate saves on the fly, thus still reducing the total number of factorizations. And it's these kinds of schemes where DiffEq's methods could help.

e.g., those optimized specifically for compressible Navier-Stokes or DG discretization - I assume these are from David Ketcheson since we went back and forth on some of these methods a while back

Yup they are.

I would also be interested in testing out the extrapolation methods already implemented (since I have a keen interest in them)

I think the parallelized implicit extrapolation methods are probably not for this domain. For sufficiently small ODEs (<100) you can benefit by factorizing multiple W at once. For what you're doing, you've probably already optimized parallel factorization (parallel GMRES etc.), and if that's the case I don't think using an implicit extrapolation would be better than a well-tuned high order IMEX scheme.

also would be interesting in hearing anyone's experience with the exponential time integrators already implemented.

I want to test them some more. Our higher order IMEX methods are quite well-tuned, thus so far they have tended to fair better. I'll get some new results on pseudospectral PDEs up on DiffEqBenchmarks.jl probably by tomorrow, and so we'll see if that still tends to be the case. I think the KIOPS implementation (Krylov phimv for EPIRK) might need some changes to work with GPUs/distributed though:

https://github.com/JuliaDiffEq/ExponentialUtilities.jl/blob/master/src/kiops.jl

That's a planned GSoC for next summer. Also, another planned GSoC for next summer is for approximate matrix factorization (generalized ADI: ADI is just approximate matrix factorization on the Trapezoid ODE solver). AKA 1-dimensional IMEX. So maybe we might need to add some more things to try everything you want, but I think we're on a path that we could both benefit from!

ChrisRackauckas commented 4 years ago

Also, our current IMEX assumes nonlinearity on the implicit side, but we should have the linear handling done in the next month, which is why we want to start trying big PDEs.

ChrisRackauckas commented 4 years ago

I was talking with @sandreza and we realized that doing this will make it so that adjoint sensitivity analysis would be possible on CLIMA, so we found a research student (UROP) and will be making this the goal of the UROP.

tapios commented 4 years ago

@ChrisRackauckas Thanks for your thoughts. What almost every compressible climate model uses for timestepping combines

1D implicit solves for sound and gravity waves in the vertical
Explicit substepping of sound and gravity waves in the horizontal
Explicit RK steps for other 3D dynamics
Longer timesteps for slow processes (e.g., radiation)

Additionally, there usually is also a need for an intermediate faster substep (e.g., for microphysics), with timescales between sound waves and the dynamical scales. Usually, the solvers for 1) and 2) are lower order than the rest (we do not need high accuracy for sound and gravity waves, stability and time-to-solution are the primary concern).

So we need a multirate strategy that combines different explicit steps with vertical implicit solves. Is there anything in DifferentialEquations.jl that may help with this?

ChrisRackauckas commented 4 years ago

So we need a multirate strategy that combines different explicit steps with vertical implicit solves. Is there anything in DifferentialEquations.jl that may help with this?

Not quite yet, but that's one of our goals. "Standard ODE" stuff is pretty close to complete, so spending lots of time in specialized methods for PDEs is probably the right way to go. I am planning to apply for a CSSI grant and the aim of it will be to production-ize and scale methods for stochastic equations (SDEs and jump processes) and specialized time stepping methods for PDEs. So while we don't have everything a climate model needs it yet, we're willing to build this. So let's do this in parallel: get ties into Clima and also figure out exactly what routines you need, start benchmarking, and gear next year's DiffEq GSoC towards methods that Clima needs.

This is how we plan to get there. We will be expanding is our SplitODEFunction interface, currently documented here:

http://docs.juliadiffeq.org/latest/types/split_ode_types.html http://docs.juliadiffeq.org/latest/solvers/split_ode_solve.html

Essentially, instead of taking in f from the user, we have a way to get f = f1 + f2. That gets us IMEX and exponential integrators, but I think we need to take that further allowing tuples of functions, and once we have that level of information we can specialize a lot more.

1D implicit solves for sound and gravity waves in the vertical

Two things we want to get done here. First of all, we want total specialization of the nonlinear solvers so that, when it's linear, it instead just solves the linear system once. We are currently going back through the nonlinear solvers to make this happen.

Secondly, these splitting methods (such as ADI) fall under approximate matrix factorization:

https://github.com/JuliaDiffEq/OrdinaryDiffEq.jl/issues/407

The plan here is that if we have something like SplitImplicitFunction(A1,A2,A3), then our Jacobian-handling technique will know to treat them separately using the splitting approximations.

Explicit substepping of sound and gravity waves in the horizontal Explicit RK steps for other 3D dynamics Longer timesteps for slow processes (e.g., radiation)

We currently don't have this, but it's on our mind. I want to implement all of this:

http://www-sop.inria.fr/tropics/MAIDESC/LIVRAISONS.d/T5-D2bis.pdf

One big direction are extrapolated multirate methods:

https://link.springer.com/article/10.1007/s10915-012-9662-z

But I think that's not quite the direction to go. Extrapolation is usually used when you are error-limited. If the stepping is stability limited, then there are an interesting set of techniques for stabilization of Runge-Kutta methods. One mixes an implicit part with a stabilized explicit part:

https://epubs.siam.org/doi/abs/10.1137/S1064827503429168

We just finished an implementation of IRKC in OrdinaryDiffEq.jl, but that just handles f = f_implicit + f_explicit and does a form of stabilization on the explicit part so it can still handle some semi-stiff parts. I think that an interesting way forward for be to allow f = f_i1 + f_i2 + f_i3 + f_e1 + f_e2 + f_e3 which does the approximate matrix factorization on the implicit part and then does separated explicit stabilization in a way that attempts to stabilize each of the dynamics separately. That should minimize the f calls more than extrapolated techniques, and is an interesting research direction I'd like to take.

Pinging @kanav99 @yingboma just so you're aware.

tapios commented 4 years ago

Not quite yet, but that's one of our goals. "Standard ODE" stuff is pretty close to complete, so spending lots of time in specialized methods for PDEs is probably the right way to go. I am planning to apply for a CSSI grant and the aim of it will be to production-ize and scale methods for stochastic equations (SDEs and jump processes) and specialized time stepping methods for PDEs. So while we don't have everything a climate model needs it yet, we're willing to build this. So let's do this in parallel: get ties into Clima and also figure out exactly what routines you need, start benchmarking, and gear next year's DiffEq GSoC towards methods that Clima needs.

Good plan!

We currently don't have this, but it's on our mind. I want to implement all of this:

http://www-sop.inria.fr/tropics/MAIDESC/LIVRAISONS.d/T5-D2bis.pdf

Have a look at this multirate infinitesimal step method too: https://journals.ametsoc.org/doi/pdf/10.1175/MWR-D-13-00068.1

It looks promising for what we need (and the paper also has a good discussion of the standard method in most compressible atmosphere models).

mwarusz commented 4 years ago

Hi @ChrisRackauckas . I took a stab at using DifferentialEquations.jl in CLIMA. I am happy to report that without too much effort I was able to run our code using DifferentialEquations.jl time integrators. So far I only tried explicit solvers with constant dt but everything seems to work, including running the code with multiple MPI processes and on the GPU. You can see how I implemented solve! here.

Next I did some benchmarking of our hand-rolled Carpenter & Kennedy (5,4) LSRK versus the one in DifferentialEquations.jl. Early results indicate that our code is about 15% faster both on the CPU and the GPU. By using nvprof on the GPU I found out that, in addition to the rhs evaluation, the DifferentialEquations.jl code calls 3 kernels per stage, whereas CLIMA calls only one additional kernel. You can see it here, in the traces of one call to solve! for both code variants. The sequence ptxcall_knl_nodal_update_aux, ptxcall_volumerhs, and ptxcall_facerhs is one evaluation of the rhs, other kernels come from the ODE solver.

Am I missing some configuration option that would speed up the DifferentialEquations.jl based solver ? Does DifferentialEquations.jl have a way of fusing broadcasts that occur sequentially into a single loop/kernel call ? This may become more important when using fancier time integrators than LSRKs, see for example our ARK stage_update! kernel.

Anyway, this results are very encouraging !

ChrisRackauckas commented 4 years ago

Hi @ChrisRackauckas . I took a stab at using DifferentialEquations.jl in CLIMA. I am happy to report that without too much effort I was able to run our code using DifferentialEquations.jl time integrators. So far I only tried explicit solvers with constant dt but everything seems to work, including running the code with multiple MPI processes and on the GPU. You can see how I implemented solve! here.

Awesome, that's great to see! What was the code you used to run this? We can get a benchmark setup and just start playing with it. Let us know if there's a few different interesting cases we should be looking at.

Am I missing some configuration option that would speed up the DifferentialEquations.jl based solver? Does DifferentialEquations.jl have a way of fusing broadcasts that occur sequentially into a single loop/kernel call ? This may become more important when using fancier time integrators than LSRKs, see for example our ARK stage_update! kernel.

Nope, you didn't miss anything. That was just us being careless because we didn't optimize for this domain yet. You can see it right here:

https://github.com/JuliaDiffEq/OrdinaryDiffEq.jl/blob/master/src/perform_step/low_storage_rk_perform_step.jl#L63-L71

@kanav99 came up with a really cool solution (Williamson wrapper) so that the 2N low-storage methods could reach their theoretical memory limit without requiring an alternative ODE form from the user, but we didn't minimize the number of broadcasts yet. You can see the 3 broadcasts and thus 3 kernels in that loop. That shouldn't be difficult to fix.

ChrisRackauckas commented 4 years ago

Hey @mwarusz, with https://github.com/JuliaDiffEq/OrdinaryDiffEq.jl/pull/933 the Low Storage 2N methods should now be 1 kernel + one f call per stage. Can you share your testing code?

mwarusz commented 4 years ago

Hi @ChrisRackauckas , I pushed out a version of the code I used for bechmarking on CLIMA mw/differentialequations branch. You can run this example from the top level CLIMA directory on

single CPU

julia --project=. test/DGmethods/Euler/isentropicvortex_benchmark.jl

mutiple CPUs

mpirun -np 2 julia --project=. test/DGmethods/Euler/isentropicvortex_benchmark.jl

single GPU
```
julia --project=env/gpu test/DGmethods/Euler/isentropicvortex_benchmark.jl
```
For convenience, I put the most important parameters that you may want ot change near the top of the test file.

I had a quick look at JuliaDiffEq/OrdinaryDiffEq.jl#933. Right now I don't think this will work with CLIMA since it seems to be based on the WilliamsonWrapper type which we can't use for now. That's because our rhs functions depend not only on setindex! but also on some internal members of MPIStateArrays. Hopefully, this will be an easy fix but I am going to a workshop next week so I won't be able to work on this for a while.

ChrisRackauckas commented 4 years ago

The limitation on setindex! is just for the accessing of du. If that needs to be different, we can expand the API for the sorts of things you need. It's just a nice way of handling the fact that you need to treat the 2N methods specially in order to get minimal memory. The other way to do this is to do a different AbstractODEProblem, which would be fine as well. I'll need to dig in and see if we can extend the WilliamsonWrapper approach in a way that works for what you need.

ChrisRackauckas commented 4 years ago

Could we get some help reviving this and understanding what a good test case would be? I'd like to start using Clima as a testing ground for ODE methods, I'm trying to understand the interface here by looking at https://github.com/CliMA/ClimateMachine.jl/blob/master/src/Numerics/ODESolvers/StrongStabilityPreservingRungeKuttaMethod.jl . So to hook up with DiffEq, we'd need to:

Make a wrapper struct over a DEIntegrator that is <: AbstractODESolver
Make dostep! a wrapper over step! (https://docs.sciml.ai/latest/basics/integrator/#Initialization-and-Stepping-1). I am not sure what the https://github.com/CliMA/ClimateMachine.jl/blob/master/src/Numerics/ODESolvers/StrongStabilityPreservingRungeKuttaMethod.jl#L85-L87 parts are.
Other than that https://github.com/CliMA/ClimateMachine.jl/blob/master/src/Numerics/ODESolvers/StrongStabilityPreservingRungeKuttaMethod.jl#L102 the rhs! seems to be the same function as DiffEq wants, but what's increment = false?
I assume it would end with realview(Q) .= integrator.u, but https://github.com/CliMA/ClimateMachine.jl/blob/master/src/Numerics/ODESolvers/StrongStabilityPreservingRungeKuttaMethod.jl#L100 does this mean it should start by modifying integrator.u before stepping? I'm curious when that comes into play: event handling and callbacks?

tapios commented 4 years ago

@thomasgibson Could you please help answering @ChrisRackauckas's questions above?

ChrisRackauckas commented 4 years ago

I'll try throwing together a WIP PR and see where it lands, and we can work from there. If we get this setup, we'd like to start using Clima as the testing ground for new IMEX and multirate methods (@kanav99)

simonbyrne commented 4 years ago

I am not sure what the https://github.com/CliMA/ClimateMachine.jl/blob/master/src/Numerics/ODESolvers/StrongStabilityPreservingRungeKuttaMethod.jl#L85-L87 parts are.

It is for the LSRK methods. The exact functionality is explained here: https://github.com/CliMA/ClimateMachine.jl/blob/195bdaa323086c67a7aa4d1b5d99612f077ff3fe/src/Numerics/ODESolvers/StrongStabilityPreservingRungeKuttaMethod.jl#L75-L78, but I think it is similar to what WilliamsonWrapper in OrdinaryDiffEq does (but I don't think it quite matches).

what's increment = false?

rhs!(dQdt, Q, param, t; increment=true) calculates dQdt .+= f(Q,param,t), instead of dQdt .= f(Q,param,t).

jkozdon commented 4 years ago

It is for the LSRK methods.

It is actually for the multirate methods, adding the slow tendency/rhs as forcing in the fast tendency/rhs.

CliMA / ClimateMachine.jl

DifferentialEquations.jl timestepping? #421