SciML / DiffEqFlux.jl

Pre-built implicit layer architectures with O(1) backprop, GPUs, and stiff+non-stiff DE solvers, demonstrating scientific machine learning (SciML) and physics-informed machine learning methods
https://docs.sciml.ai/DiffEqFlux/stable
MIT License
865 stars 154 forks source link

High dimensional Mean Field Games in JuliaDiffEq ecosystem #113

Open finmod opened 4 years ago

finmod commented 4 years ago

@ChrisRackauckas This MFGnet.jl package is significant in applying Lagrangian methods to MFG and therefore lessening the curse of dimensionality in this class of problems (dimension from 2 to 100 in seconds).

The package is just published at: https://github.com/EmoryMLIP/MFGnet.jl.git together with the paper at: https://arxiv.org/abs/1912.01825. It could benefit from cross-breeding as the NN learning is based in both cases (DiffEqFlux and MGFnet) on Flux. The formulation of MFG, equations 25, could be formulated as on paper using ModelingToolkit and DiffEqOperators. The MeanFieldGame API could be available in DiffEqFlux.

Other potential efficiency gains: integrators RK4 versus Tsit5, optimization BFGS versus LBFGS and others, MC integration and others, etc. @swufung

ChrisRackauckas commented 4 years ago

Yeah, it would be interesting to set it up with a higher order method and possibly using a better adjoint than backprop here. However, I think most of what they're doing is stepping simultaneously from different points to train the NN continuous approximation, which means they'd just have to hook into the stepping tableaus directly.

swufung commented 4 years ago

@finmod I agree. we could benefit from having a more flexible setup, including different time-steppers and optimization algorithms; this would allow us to find the right set up for specific MFGs. There are also many things our prototype code does not exploit such as parallelism (which are inherent in these Lagrangian schemes) and GPUs. @lruthotto

lruthotto commented 4 years ago

As @swufung said, increasing the flexibility of the package is a top priority of ours. There are a number of ways how to benefit more from the Julia ecosystem. Personally, I think the optimization is a top priority. Another one would be improving the efficiency, especially using the GPU packages out there. Also, our first prototype is far from the best practices of Julia programming.

I'm not against using the time-steppers from DiffEqFlux, however, I wouldn't expect the adjoint method to give us much of a boost here. We could have a longer discussion about this at some point if you like. For now, I would recommend this paper https://arxiv.org/abs/1902.10298. Also, for a direct comparison of optimize-discretize (which is used in Neural ODEs) and discretize-optimize (based on backprop), our student @donken has created a quick demo using an example in DiffEqFlux; see https://imgur.com/nWxwVoe.

ChrisRackauckas commented 4 years ago

I'm not against using the time-steppers from DiffEqFlux, however, I wouldn't expect the adjoint method to give us much of a boost here. We could have a longer discussion about this at some point if you like. For now, I would recommend this paper https://arxiv.org/abs/1902.10298. Also, for a direct comparison of optimize-discretize (which is used in Neural ODEs) and discretize-optimize (based on backprop), our student @donken has created a quick demo using an example in DiffEqFlux; see https://imgur.com/nWxwVoe.

On problems which are less stable than that the adjoint's gradients will be much better and the memory overhead is reduced. https://arxiv.org/abs/2001.04385 mentions that the newer adjoint forms were actually required to make some of the examples computable. For a scaling graph, you can see in https://arxiv.org/abs/1812.01892 that the adjoint scales much better to large problems than backprop. At this point it's fairly clear that the newer adjoints almost always give a performance advantage.

Was this example made with DiffEqFlux v1.0 using BFGS? The bigger issue with the earlier versions is that ADAM is a fairly bad optimizer and has some random performance behavior.

But yes, the best thing to do would be multiple shooting, which we have been using the library for but don't have a specific training function for.

lruthotto commented 4 years ago

Thanks for these references. I will read them with interest later. Again, I wouldn't be against trying adjoints, multiple shooting, and the other tools you've built in this package. Thanks to the work done here and by other people in the Julia community, this shouldn't be too much work.

As far as I remember, the optimization is done with ADAM here. Sure, switching to BFGS is a better idea as long as the gradients are accurate.