SciML / DifferentialEquations.jl

Multi-language suite for high-performance solvers of differential equations and scientific machine learning (SciML) components. Ordinary differential equations (ODEs), stochastic differential equations (SDEs), delay differential equations (DDEs), differential-algebraic equations (DAEs), and more in Julia.
https://docs.sciml.ai/DiffEqDocs/stable/
Other
2.84k stars 224 forks source link

GPU Parallel Monte Carlo #308

Closed agerlach closed 4 years ago

agerlach commented 6 years ago

I have a need to run some large scale Monte Carlo simulations w/ random initial conditions and parameters for small scale ODEs (<10 states). Is it possible to do parallel Monte Carlo simulation using the GPU?

I see that you support GPU, but all the examples look like the parallelization happens within the EOMs and you need very large systems vs. each GPU thread solving the EOMs independently on different data.

I have implemented large scale Monte Carlo simulations w/ VexCL and Boost ODEINT but event detection is essentially non existent. I will admit that I am a Julia novice, but my interest in learning more is primarily driven by this awesome library.

Thanks.

ChrisRackauckas commented 6 years ago

The parallel Monte Carlo simulation tooling is built around running solve commands in parallel. As of right now you cannot compile the ODE solver to run on a GPU kernel so this will not work. That said, we are trying to utilize Cassette.jl along with CUDANative.jl to do this. CUDANative.jl allows you to compile your Julia code directly to CUDA, but not all code can do this. Cassette.jl would allow us to dynamically modify the AST of the ODE solver to turn it into a GPU-compilable code, then build the .ptx via CUDANative, and then you can call the ODE solver on each GPU core. This is still a ways away though.

Until we have that, yes our GPU parallelism is at the vector level. If I'm not mistaken, that's the same as VexCL and ODEINT strategy which GPU-parallelize through GPU-based vectors? If not, then I'd like to hear some details about what they're doing. Given that, you'd parallelize a Monte Carlo simulation by making a giant matrix with different columns corresponding to runs with different parameters.

Of course, this would run into issues with events if the event handling dependent on the solution values, since then running a bunch of concurrent simulations would make there be a lot of events and make the time step very small. In which case you'd want to have our first solution.

But yes, parallelizing lots of runs with small ODEs is on our mind and we need a good solution. I may end up coding specialized versions of the ODE solver that can compile to the GPU more easily specifically for this case, if I cannot get the more general compilation to work soon after v0.7 is released.

agerlach commented 6 years ago

Bummer. Thanks for the info though. Other than this one "killer feature" I would say that DifferentialEquations.jl is the tool I have been looking for for years. Hopefully, some of my experience using VexCL + ODEINT will be of some help.

Until we have that, yes our GPU parallelism is at the vector level. If I'm not mistaken, that's the same as VexCL and ODEINT strategy which GPU-parallelize through GPU-based vectors?

Yes and no. VexCL certainly does parallelism at the vector level, but I think the vectors are defined differently. DifferentialEquations.jl is parallelizing across the state vector. VexCL is parallelizing across an ensemble vector. For example, lets say you want to run n Monte Carlo simulations of a system with m states. In VexCL you first create n arrays of length m (or vector of vectors) and then initialize them with random initial conditions. The n arrays (1 for each state) get passed as arguments to the resulting kernel and then m gpu threads run, operating only on a given idx into the m length arrays.

For my application the dynamics are quite slow, so I've been able to get away with a fixed time-step RK4 integrator...for now. It does sound like adaptive time stepping should work with VexCL, you just need to keep track of an array of time steps, one for each thread.

Each kernel call simply does a single step update for the whole ensemble. This gets wrapped in a host for loop until the final time is reached. Event detection is handled pretty crudely by running a separate kernel after each step to check if the desired 0 crossing occurs. If so, linear interpolation is performed between the current and prior states. I haven't implemented time step refinement because, to this point, this has been sufficient for my application.

This was all done using VexCL's symbolic types. For more details on how I generated the kernels for my problem see VexCL Issue#202 . I also discussed event detecction in VexCL Issue#203

ChrisRackauckas commented 6 years ago

I see. Yes, you can do the VexCL version in DiffEq by using a GPUArray of static arrays and then broadcast calling your derivative kernel on this GPUArray{SArray}. But that will never scale to event handling since then the timesteps are global, i.e. all integrations are advancing with the same dt, so having a different event on each one will slow it down enormously... so I wouldn't even try this approach with events.

But I have a solution in mind. It will need Julia v1.0 and since that will be out in <2 months, let's wait for then. Ping me after Julia v1.0 comes out and I'll get an adaptive integrator that auto-compiles to the GPU with event handling.

agerlach commented 6 years ago

@ChrisRackauckas Awesome! Thanks for all your great work.

titusfranz commented 6 years ago

@ChrisRackauckas Thank you for your amazing package, I'm using it already heavily for my simulations. Julia 1.0 is now out, are there already news about the adaptive intergrator that auto-compiles to the GPU with event handling?

Thanks in advance.

ChrisRackauckas commented 6 years ago

What I have planned here for the short term won't do event handling. Doing event handling would require getting the of OrdinaryDiffEq.jl to compile to the GPU which needs Cassette.jl to be released (so we can overdub the array allocations)

ChrisRackauckas commented 5 years ago

Found some precedence: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5710898

agerlach commented 5 years ago

@ChrisRackauckas Good find. I actually stumbled on that paper a few years ago when I started using VexCL+ODEINT but completely forgot about it. FYI, the author has source code here: LibBi

My apologies for my radio silence on this. I somehow missed your previous posts. What do you have in mind for your short-term plan?

ChrisRackauckas commented 5 years ago

My short term plan is a direct implementation of Tsit5 into SimpleDiffEq.jl using only static arrays and only allowing saveat so it can interpolate to a fixed size output array. That should be CUDANative-able. From there we can do Vern9 and a low storage method. That should be a good setup while we play with trying to get full integrators to compile. These will not have the integrator interface because the difficult part is precisely the integrator mutable struct's interaction with CUDANative, but these should be able to handle ODEs without events. Then a simple Rosenbrock23 and RKN methods handle the stiff side, and that's a decent GPU offering.

@YingboMa could probably help with getting some of it done. We have applications and funders who are interested so that will drive it pretty soon.

ChrisRackauckas commented 5 years ago

The SimpleTsit5 and SimpleATsit5 implementations should be GPU-compatible given discussions with the CUDANative.jl devs. The PR which added the adaptive one can be found here:

https://github.com/JuliaDiffEq/SimpleDiffEq.jl/pull/10

Essentially that solve call can just be CUDANative compiled to a kernel since all of the mutables have a lifespan within the solve call.

ChrisRackauckas commented 4 years ago

This is now being handled at https://github.com/JuliaDiffEq/DiffEqGPU.jl