Zymrael / awesome-neural-ode

A collection of resources regarding the interplay between differential equations, deep learning, dynamical systems, control and numerical methods.
MIT License
1.26k stars 144 forks source link

Adding a reference #3

Closed patrick-kidger closed 4 years ago

patrick-kidger commented 4 years ago

Hello - thanks for this resource! I find it really helpful.

I've been meaning to ask if it would be acceptable to add a reference to our recent paper https://arxiv.org/abs/2005.08926. I'm happy to open a pull request adding this if you like.

Zymrael commented 4 years ago

Thank you for the kind words. Added! I was just about to add Neural CDEs and some other Neural ODE papers from the recent flood of interesting arXiv preprints.

I would also like to extend an invite to you and your collaborators to try out the torchdyn library, and perhaps contribute if you find it useful. Neural CDEs are already in the planned feature roadmap, so it'd be helpful to receive your input on where you'd feel they best fit in the library.

patrick-kidger commented 4 years ago

Thankyou!

Thanks also for the invite. We've also been thinking about the best way to make NCDEs work as a library, as there's a bunch of edge cases (of course!) that need handling correctly. This is something we're still figuring out, but once we're there I'd already planned to get in touch re: torchdyn.

patrick-kidger commented 3 years ago

@Zymrael Coming back around to this now with the release of torchcde. You might find it interesting to have a look at what we're doing there with respect to torchdyn.

The key components are:

The actual implementation is all very straighforward. In large part torchcde is about trying to create an API that encourages getting the data in the right format for a neural CDE (including time as a channel; how to handle irregular data) as it's easy to get that silently wrong.

Zymrael commented 3 years ago

@patrick-kidger Thanks for sharing; the tutorials are well made (though I tend to prefer notebooks than scripts) and I enjoy your commenting style. On our end, we are planning an early October release with a lot a focus on Neural SDEs, sequence / latent variants and some utilities for control. I have a baby Neural CDE version somewhere in my WIP folders and was just recently thinking about whether I should push it to a usable state for torchdyn. Any particular reason for an entirely separate package e.g are you planning to expand CDEs in future research with more interpolation styles (or perhaps using some other learning technique to obtain $\dot X(t)$?). I would be sort of hesitant to add yet another dependency to torchdyn, especially given the fact that we could use interpolation schemes for other purposes and so we might end up having to implement them anyway.

It'd be nice to receive your feedback on the direction you think will be best for us as a community. I see us going for either (1) a "shattered" but highly interconnected collection of highly specialized libraries SciML-style, where we'd have a group of active maintainers working on different packages (such as yourself with your cool work on torchsde and now torchcde). Here is where we'd perhaps eventually see torchcontrol (which we're working on), torchsysid etc. The other direction (2) would be to avoid splitting up into too many separate dependencies, thus streamlining long-term maintenance and addition of new models and methods. I suppose this would be a pro for new users, which could jump in without having to navigate through dozens of libraries. Even if going for (1), perhaps torchdyn could serve as sort of a "glue" entry point package with tutorials and higher-level abstractions to draw new people in? My only worry with (1) is that so far we really do not have a very active open-source community for maintenance tasks; most of the code contributions for neural differential equations are one-time throwaway conference .zip implementations using different conventions and reimplementing ODEBlock and ODEFuncs each time. If we could clone 10 Patricks, perhaps we'd be set :D

We'd enjoy chatting about torchcde and the issues above on Zoom at some point, let me know.

patrick-kidger commented 3 years ago

This post is a bit of a behemoth.

torchcde

Thanks, I'm glad you like the tutorials.

In terms of having a separate package - for myself, this is mostly intended as a convenience. For example I'd like to have a cdeint function, just so that I don't have to manually plumb things together before calling odeint, and so that it checks and throws an error when I mess up tensor shapes or something. (Happens all too frequently.)

For others, I'm hoping it serves as an easy way to use NCDEs without having to worry too much about the details of what a Stieltjes integral actually is, and as a place to clearly document "please remember to add time as a channel" and "irregular data is easy". Regardless, I'll be the first to admit that if you do know these things, then there's nothing terribly complicated actually happening here.

Looking to the future, there's quite a lot of CDE theory and I wouldn't be surprised if some of it ends up making sense to implement in torchcde at some point. (You can already see this with the logsignature stuff - c.f. the bottom of the README - paper on that to appear shortly.) There's definitely a lot still to be done with NCDEs.

ecosystem

Haha, I wish there were 10 Patricks - my papers might already be written!

I think my preference would be to put everything in one spot. The reason for that is that it would be nice to have diffeq-related things in together, akin to torchvision, torchaudio, torchtext. This also means we've got a nice single place to implement some of the things that, as you say, appear in conference implementions, which usually just involves a cloned+hacked version of torchdiffeq that is then forgotten about.

I've not spoken to Ricky about this (he may well feel differently!), but if I had a magic wand I think I'd put everything into, and hugely expand, torchdiffeq, including SDEs, CDEs etc. It's got the market share, and it's got dibs on IMO the best name, relative to the other torch* family of libraries. I've worked with Ricky on torchdiffeq before (the current version is in large part my own rewrite-from-the-ground-up) so he may be amenable. Then maybe we all write a software paper saying "this exists".

Practically speaking I'm not expecting anything like that to happen quickly. If nothing else torchsde is in a state of high flux atm, and not suitable for merging with anything. It also raises concerns about trying to make sure that all the [ode|sde|cde]int all use the same interface, and whether or not we should try and re-use code betweeen them.

I think in this model, I'd look to follow the torch/torchvision pattern. Have torchdiffeq.odeint, and have torchdiffeq.models.neural_ode18, and avoid any intermediate-scale stuff like NeuralODE just like there's no canned ResNet other than resnet18.

On interpolation schemes in particular, I have pondered splitting that out into a torchinterp. I already maintain a torchcubicspline repo that I hacked together from an early version of torchcde; superseding that with torchinterp is something I may do after the ICLR deadline once I get a bit more time.

Side note: I don't think a comparison to the shattered approach of SciML is actually fair, because even there they have DifferentialEquations.jl, which puts basically all the differential equation stuff in one place. The things that get spun out into separate libraries seem to be non-diff-eq things, like quadrature or banded matrices or what have you.

implications

What all of that actually boils down, at least in the short term:

Zymrael commented 3 years ago

A behemoth of a post, but an interesting one :D. I'll provide my own beastly wall-of-text below:

ecosystem: The unifying torchvision-inspired approach is something we can get behind, and as you mention seems to be a successful model used by many major torch* revolving around a specific data modality. I agree that the most pressing issue for us as a community would be to find a way to merge the solver packages, with most potential code reutilization.

As you can imagine, however, I do not agree regarding intermediate NeuralODE-level APIs. I'd argue that here to model to follow perhaps shouldn't be torchvision, as our application domains and requirements are very different. I could definitely see a canned model section for density estimation, or time series classification, but as we start approaching scientific / control application the user needs access to an intuitive intermediate API to specify partial models, controllers, higher-order dynamics. and more. A large portion of our user base works in these domains, and from the feedback we've received these specific features are what makes torchdyn particularly useful to them. If it were up to us, we'd keep the intermediate torchdyn API (under .model), optimizing and using it internally to design canned .model.neural_ode18--type implementations for classical ML tasks.

Something interesting about Julia and SciML: DifferentialEquations.jl, does indeed glue together a lot of methods, but they maintain a separate DiffEqFlux (torchdyn-like) library and more, such as DiffEqOperators.jl, DiffEqBase.jl, StochasticDiff.jl.

  @reexport using DiffEqBase
  @reexport using DiffEqNoiseProcess
  @reexport using RecursiveArrayTools
  @reexport using SteadyStateDiffEq
  @reexport using StochasticDiffEq
  @reexport using OrdinaryDiffEq
  @reexport using BoundaryValueDiffEq
  using Sundials
  @reexport using DelayDiffEq
  @reexport using DiffEqCallbacks
  @reexport using DiffEqJump
  @reexport using DiffEqFinancial
  @reexport using MultiScaleArrays
  @reexport using DiffEqPhysics
  @reexport using DimensionalPlotRecipes
  @reexport using ParameterizedFunctions

Three big components of their ecosystem are the specific solver packages, DiffEqSensitivity.jl for sensitivity methods, DiffEqFlux for neural diff. equations. I think it's fair to call this ecosystem shattered, but it's not necessarily a bad way to operate (and it's working great for them). This split inspired the general structure of torchdyn: torchdyn.models, torchdyn.sensitivity and torchdyn.solve (paused as we've decided to keep relying on torchdiffeq and torchsde, though we've been toying with the idea of using Julia as an alternative backend).

implications As you mention it might be interesting to know if the UofT guys have non--Python plans regarding the diffeq ecosystem; last time we talked to Xuechen he mentioned Jax as a better solution for specific technical challenges (e.g computing jets) but expressed a preference for PyTorch still. I'd be very interested in asking Ricky about his thoughts on future directions, and I suspect we'll learn more about his thoughts soon after we find out what he's been working on :)

torchcde Here is a small example CDE I hacked together quickly the other day

class ControlledFunc(nn.Module):
    def __init__(self, f, X):
        super().__init__()
        self.f, self.X = f, X
    def forward(self, x):
        x, t = x[:, :-1], x[:, -1:]
        x = self.f(x).reshape(x.shape[0], x.shape[1], self.X.dim)
        return torch.einsum('bij,bj->bi', x, self.X.derivative(t))

f = nn.Sequential(
        nn.Linear(2, 64),
        nn.Tanh(), 
        nn.Linear(64, 4))

X = Interpolant(data)

func = nn.Sequential(DepthCat(1), ControlledFunc(f, X))
cde = NeuralDE(func, solver='dopri5')
z = torch.randn(100, 2)
out = cde(z)

It turns out for our specific needs a torchinterp suite would be very valuable, allowing us to bridge the cdeint odeint/sdeint gap at model definition. I understand the reasoning and incentives behind providing a dedicated cdeint method and a separate package; personally I think it is much clearer (for end-users as well) to show that CDEs are really still ODEs or SDEs with a specific vector field form that can be determined at a nn.Module level. I could see that changing when dedicated CDE numerical methods are added though. Practically speaking, our current plan is to utilize torchcde as a torchinterp placeholder and still rely on odeint calls, which preserves compatibility with the NeuralODE ecosystem we built-up over the last few months (e.g depth-varying parameters, energy models, norm. flows, GDEs, latent models...) and the upcoming control stuff.

Good luck with your ICLR submissions, looking forward to seeing what you've been up to in the past few months. We'll get back in touch to organize a Zoom meeting after the ICLR storm passes!

patrick-kidger commented 3 years ago

Haha - alright, here's a much shorter response.

ecosystem: Yeah, I guessed you might not regarding the NeuralODE-level APIs. That's not a point I feel strongly about though.

Ah, you clearly know a bit more about SciML than I do. I've not used it, but everything about it looks pretty good. If it wasn't for the fact that the ML community has standardised on Python then I'd probably make the switch to that myself.

implications: I'm guessing you mean "non-PyTorch". ;) Yeah, JAX has much better autodiff than PyTorch. This is something we've been running into in torchsde: we'd like to compute batch-vjps and PyTorch just doesn't support them.

On the flip side - and I've not experimented with this at all - unlike JAX, I think PyTorch supports task-based parallelism via torch.jit.fork, and I've pondered using that to have each batch element use a dedicated solver (rather than the current approach of having everything make integration steps at the pace of the slowest). Do you know if DifferentialEquations.jl does anything similar?

torchcde: Interesting; do you use the convention that time is always the last channel of the state? That touches on a bit of a subtlety to this whole procedure (one of the things cdeint covers up for you): there's two different notions of time. One is the "solver time" using in the integration steps (and gets fed into X.derivative), the other is the "data time" that should be part of the data.

I won't try and get into that now though. What you're proposing makes sense to me.

Anyway, let's pick all this up whenever we have that meeting.

(PS: since you mention it - if you don't already know, torchdiffeq now supports grid_points and eps options that makes doing piecewise-constant depth-varying parameters efficient as part of a single solve.)

johnp-4dvanalytics commented 3 years ago

@patrick-kidger @ChrisRackauckas

On the flip side - and I've not experimented with this at all - unlike JAX, I think PyTorch supports task-based parallelism via >torch.jit.fork, and I've pondered using that to have each batch element use a dedicated solver (rather than the current approach >of having everything make integration steps at the pace of the slowest). Do you know if DifferentialEquations.jl does anything >similar?

I think https://github.com/SciML/DiffEqGPU.jl has a dedicated solver for each thread, although I'm not sure. I believe that they are still working on getting the autodiff working for it though: https://github.com/SciML/DiffEqGPU.jl/pull/72

Btw I have been trying to do an implementation of the torchcde method in DiffEqFlux: https://github.com/SciML/DiffEqFlux.jl/issues/408

I was able to get a bit of a speedup (~5.5x) over the PyTorch version by using DiffEqFlux, but I think the main way to get any other major speedups would be having the dedicated solvers for each thread.

ChrisRackauckas commented 3 years ago

https://diffeqflux.sciml.ai/dev/examples/optimization_sde/ is a tutorial that demonstrates task-based parallelism on SDEs. You can run that on a cluster too just by adding EnsembleDistributed: it's quite fun. It's just using https://diffeq.sciml.ai/stable/features/ensemble/ . All of the derivative options work out except EnsembleGPU + adjoints, which as mentioned by @johnp-4dvanalytics which is SciML/DiffEqGPU.jl#72 in that it needs a special adjoint because the "tasks" on a GPU aren't cleanly separated (i.e. we're composing the differential equations together for the user, and the adjoint fails in the "split back apart" code right now...). But EnsembleGPU allows for process-parallelism as well, so that'll be what supports multi-GPU on a cluster for 50,000 trajectories at once at different points (or whatever you want of course).

Note that with the task-based parallelism you can assign a GPU per thread, and it should "just work". https://juliagpu.gitlab.io/CUDA.jl/usage/multigpu/ describes how to do that, so you could use EnsembleThreads + have each thread have a separate GPU to locally do GPU-per-ODE training (instead of GPU ensembling), which I think is more appropriate for the CDE. I haven't tried it, but it should work and would be a cool tutorial demo.

Also while I'm here, there's been a lot of movement in the interop areas. GPU support now works on R through ModelingToolkit being used as a JIT: https://www.stochasticlifestyle.com/gpu-accelerated-ode-solving-in-r-with-julia-the-language-of-libraries/, and that means adjoints should work. Someone should test adjoints through MTK-complied R code. That almost means neural ODEs in R can be trained with Julia directly without writing any Julia code... but... MTK currently scalarizes operations (because of it's link to Modelica symbolic compiler systems) but that's an issue we're going to overcome by November (linked to a closed project that will require it). So... neural ODEs in R are almost done. And the R version will automatically install Julia in the background too: https://github.com/Non-Contradiction/JuliaCall/pull/135 . So R should have very good SciML support hopefully by November.

The reason why we did R first was ModelingToolkit on Python hit a few issues, which the devs are going to help me solve. Then we should be able to do similar demos from Python as well, with similar limitations, but supporting something as straightforwardly defined as a neural ODE should be fine (once non-scalarizing support is completed).