dfm / emcee

The Python ensemble sampling toolkit for affine-invariant MCMC
https://emcee.readthedocs.io
MIT License
1.47k stars 430 forks source link

(solved) issue with sampler stalling with multiprocessing #502

Open James11222 opened 9 months ago

James11222 commented 9 months ago

General information:

Problem description:

This is more of an announcement for others who might encounter the same issue, I found a solution already but I thought it should be posted somewhere and maybe added to the docs if others experience the same issue when using multiprocessing with emcee. I'm a bit of a novice with parallel processing so please forgive me if this is obvious.

Multiprocessing has worked fine in the past for most my needs in emcee, but recently I came across an issue where the sampler would stall out upon instantiation indefinitely when I used some complex external packages (pyccl). I noticed that the issue wasn't happening on my Mac but was happening on the linux cluster. After digging, I found the only way to get around this was changing context which the processes are created for the multiprocessing Pool. I noticed that my Mac was using a spawn context for creating processes where the linux version was defaulting to fork, the documentation uses the fork context as well but I found switching to spawn fixed this stalling issue when I upped the complexity of my model function code. I read online that fork is being phased out and replaced with spawn as the default context in future python as well.

If anybody experiences this indefinite stalling when running their sampler with multiprocessing (when cancelling the code after stall starts we get the following)

    300         try:    # restore state no matter what (e.g., KeyboardInterrupt)
    301             if timeout is None:
--> 302                 waiter.acquire()
    303                 gotit = True
    304             else:

I'd recommend trying to change the Pool to use the spawn context manually

with multiprocessing.get_context("spawn").Pool() as pool:
            sampler = emcee.EnsembleSampler(
                nwalkers,
                ndim,
                log_probability,
                args=(...),
                pool = pool,
                backend = backend
            )

this fixed the issue for me after spending many hours trying everything else. I didn't feel like this required a pull request since I didn't need to modify any source code but I hope this is useful for someone else.

More info I found to help me get to this conclusion can be found here: https://pythonspeed.com/articles/python-multiprocessing/