Open cems2 opened 4 years ago
See https://docs.juliadiffeq.org/latest/features/ensemble/ for sufficiently small differential equations.
thus it turns out, paradoxically, that in the current ODE solvers the only efficient way to get large scale parallelization is to treat the minibatch as one giant many-variable PDE in which the sets of the variables happen to be uncoupled. The down side of this is that besides generally being a blunt knife of a kludge method, that in an adaptive step size solver, the ODE in the minibatch with the worst stiffness sets the finest time step size for ALL of the ODEs. Furthermore, at some scale, solving all the ODEs at once will exceed the cache memory size on the CPU and become less efficient.
Not necessarily on the cache part. In fact, if you do large batches you'll get very good cache efficiency because of how it'll be hitting BLAS. It's probably the most cache efficient route as long as it fits in memory.
Basic inspiration: neuralODE is NOT explicitly mini-batch aware. you can kludge it right now, but this kludge won't scale because it's not mini-batch aware. SO introduce mini-batch smartness into neuralODE.
Problem: Observationally I have come to believe the ODE solvers are, as a rule, singly threaded. (If I am wrong on that then this suggestion makes no sense!). I realize there are some weakly dual core ODE solvers as special but not especially useful cases.
Opportunity for Minibatches:
When processing mini batches one is running many ODE solves on completely independent problems.
Naively these could each be spun up as independent solvers by multiprocessing. (aka embarrassingly parallel)
However: While that solution would work for a multi-core cpu, it would not work for a GPU nor for a SIMD CPU massively multi-threaded parallelism.
The reason vector parallel acceleration fails here (GPU or SIMD) is that they strictly require that each thread is doing the exact same math op and the same time. (Many data, same instruction).
thus it turns out, paradoxically, that in the current ODE solvers the only efficient way to get large scale parallelization is to treat the minibatch as one giant many-variable PDE in which the sets of the variables happen to be uncoupled. The down side of this is that besides generally being a blunt knife of a kludge method, that in an adaptive step size solver, the ODE in the minibatch with the worst stiffness sets the finest time step size for ALL of the ODEs. Furthermore, at some scale, solving all the ODEs at once will exceed the cache memory size on the CPU and become less efficient.
How to do better with very little pain In practice we put up with this because, at small scales, the results are not so awful despite this obvious inefficiency. But to really scale this up to super computers of big GPUs we need to do better.
So the proposed solution here is to find a compromise. As an a concrete example: if one has a minibatch of 640 in size, you could split this 20 sets of 32 threads each. Each thread batch of 32 then fills a Warp on a GPU. And you can have 20 streams running. Within each stream we still have the issue of the stiffest ODE setting the time step for the other 31 members of the warp. But the other 19 steams are not affected by the limit case in that one stream. Thus the problem will scale better as we make this larger and larger. In CPU SIMD, where the warp size is typically 4, the optimal thread batch size will depend more on memory caches than the SIMD vector size. (And this was just an illustrative example: for GPUs the optimum size of batches will vary from 32 depending on the algorithm's complexity-- e.g. a fixed time step integrator would be more likely limited by the local memory size)
(further gains acrue: Once something like this is available, clever people will organize their mini-batches so the cases are sorted by stiffness, and so the stiffest ones will all be in the same warp, maximizing efficiency)
User perspective: From the outside the whole process would be no different looking that the current process is. It's just an internal tweak to divide large minibatches up into chunks that fit nicely into warps, and streams.
And this is possible because one is taking advantage of the fact that each memeber of the mini-batch is an independent ODE set. currently we have no way to exploit that fact.
workaround I note that one can in fact do something like this manually. That is divide your large scale minibatches up into right-sized sub-batches then run them in different processes. However the stream management on a GPU or a CPU is not really going to be pretty or automatic. This is especially true if one is trying to use existing neuralODE functions using the API provided by NerualODE. This has to be done within those DiffEqFlux functions not as user-wrappers around this. Thus it makes a lot more sense to make neuralODE minibatch aware of the intrinsic independence and separability of the separate ODEs.