Parallelize Parallel Tempering

yannikschaelte commented 4 years ago

We have a for loop over temperatures --> "parfor" (multiprocessing / multithreading).

Check that parallelizing does not yield performance loss compared to serial, can potentially happen for cheap simulators.

yannikschaelte commented 4 years ago

A trivial implementation via multiprocessing pools is done in a few lines, however I saw that the pickling in each iteration yields considerable overhead (probably due to objective or trace). Thus, this way speed-ups are likely probably only for really expensive models, or using batch simulation.

An alternative is to manually define the parallel execution processes, only pickle the required objects once at the beginning and end, and after each iteration only shared across processes what needs to be shared (i.e. the last chain entries), with communication possibly by writing to and reading from a shared pipeline. This would however require considerably more effort to implement, if someone wants to have a look.

curtywang commented 3 years ago

Hi PyPESTO team -- I've been slowly giving this a try over on my fork of the repo, it's in the testing stages still, but hopefully I can clean it up and give it a PR. I have been running a fairly expensive multimodal problem, which definitely requires parallelization. Currently it works with multiprocessing/multiprocess but I'd like to get MPI working as well.

Just pickling each iteration definitely yielded exponential growth, so I've changed it so it's just sharing the last chain entries and passing them around.

paulstapor commented 3 years ago

Hi, Just to give you a quick response: That sounds very good, thanks a lot! Thanks a lot! @yannikschaelte or others, who are more into the sampling code will at some point have a look, but we're obviously very happy about contributions (in particular about useful ones ;) ).

yannikschaelte commented 3 years ago

Hi @curtywang , thanks! Can you report already any performance metrics? Ideally one would hope for 1 / cores, with certainly some overhead due to postprocessing.

If I see correctly, you create the parallel processes at the beginning of the sample() routine, and then publish a sample_1 task to all workers in each iteration, and then collect results via a queue. The main process then performs swapping and beta update, and distributes the updated variables to the workers (I could not find details on these steps yet, but the difference to master is also a bit diverged).

This is pretty much the implementation I had in mind to create low pickling / interprocess communication overhead. Before merging to master, I would like to make the parallelization engine more abstract, i.e. have minimal interference in the main sample function, such that one can easily replace the multiprocessed sampler by mpi, redis, or another distribution engine, as you already pointed out, or also fallback sequential execution. Probably one could define an abstract base class wrapping around the core tasks, similar to our optimization engines. But so far looks good already!

curtywang commented 3 years ago

I don't have any strong metrics yet, other than that my problem actually runs in a reasonable amount of time -- however I should be able to benchmark soon after adding the switch argument to the sampler creation or sample function. Yes, the main process is the one that does the swapping, then new samples are sent back. It's a bit rudimentary with the identification of the threads by doing "getting and putting back if it's not its task" (nothing like rank with Python's Manager classes, sadly), since there was considerable overhead (5s roughly per iteration) when I used Manager.list() for the children to grab data from.

Indeed, probably making the "serial" class the base class, and just overriding for parallelized versions (with a name like PoolParallelTempering) would make the most sense. Ideally, this would follow pool-based methods like what emcee has, is what I'm thinking. I'll take a look at the optimization engine a bit closer, since it's probably better to match the interface that's already there.

yannikschaelte commented 3 years ago

Sounds good! Using a simple pool-similar interface would be nice, one will likely just need to keep track of which worker is working on what, in order to minimize communication overhead and only share the "current" particles. (I fear emcee may not implement the most efficient version there, but I did not check in detail.)

yannikschaelte commented 3 years ago

@curtywang What is the status here? Just out of curiosity :)

curtywang commented 3 years ago

Hi @yannikschaelte I got ParallelTemperingSampler and AdaptiveParallelTemperingSampler separated into a Pool-using interface and tested it out on a few problems that seem to work, but want to see if I can get it to work with the engine similar to pypesto.optimize.minimize(). I've been a little caught up and haven't been able to make progress on interfacing with the engine since we're wrapping up the semester over here.

I was able to minimize communication overhead but with the default non-MPI pools, each worker needs to have its own module-level globals to keep the queue pipes open. I couldn't figure out how to make it persistent and stop from disconnecting without this (see lines 20-61 and 243-246). It also currently does use multiprocess's Manager (which will use dill to pickle), which will probably need a bit of reworking to work with mpi4py.

ICB-DCM / pyPESTO

Parallelize Parallel Tempering #352