dfm / emcee

The Python ensemble sampling toolkit for affine-invariant MCMC
https://emcee.readthedocs.io
MIT License
1.47k stars 430 forks source link

Saving Chain with multiprocessing #403

Closed mayamkay closed 3 years ago

mayamkay commented 3 years ago

General information:

Problem description: I'm trying to find a way to use multiprocessing but also save steps of the chain such that if the run crashes or doesn't converge I can continue running it from where it left off. I've been trying to do this using backends and pool but it seems the two aren't compatible. Is there an alternative approach?

Expected behavior:

Save emcee chain using backends with pool

Actual behavior:

The file can't be accessed by the different threads so it fails

What have you tried so far?:

Minimal example:

< from multiprocessing import Pool

with Pool() as pool: sampler = emcee.EnsembleSampler(Nens, ndims, logposterior, args=argslist, pool=pool, backend=backend)) # start = time.time()

sampler.run_mcmc(inisamples, Nsamples + Nburnin)

end = time.time()
multi_time = end - start
print("Multiprocessing took {0:.1f} seconds".format(multi_time))
dfm commented 3 years ago

The backend and pool are definitely compatible - that's a very common use case and the backend only runs on the main thread. Please share an executable snippet of code (the one you shared is missing a lot of definitions) that reproduces the issue you're seeing.

mayamkay commented 3 years ago

Thanks for helping with this!

Here's a snippet that gives the same error I've been getting.

import emcee
import numpy as np
from multiprocessing import Pool

np.random.seed(42)

# The definition of the log probability function
# We'll also use the "blobs" feature to track the "log prior" for each step
def log_prob(theta):
    log_prior = -0.5 * np.sum((theta - 1.0) ** 2 / 100.0)
    log_prob = -0.5 * np.sum(theta ** 2) + log_prior
    return log_prob, log_prior

# Initialize the walkers
coords = np.random.randn(32, 5)
nwalkers, ndim = coords.shape

# Set up the backend
# Don't forget to clear it in case the file already exists
filename = "tutorial.h5"
backend = emcee.backends.HDFBackend(filename)
backend.reset(nwalkers, ndim)

# set up the sampler

with Pool() as pool:
    sampler = emcee.EnsembleSampler(nwalkers, ndim, log_prob, backend=backend, pool=pool)
    start = time.time()

    sampler.run_mcmc(coords, maxN)

    end = time.time()
    multi_time = end - start
    print("Multiprocessing took {0:.1f} seconds".format(multi_time))

The above works if I take out the backends part but with that included I get the following error:


Traceback (most recent call last):
  File "/global/homes/m/mayamkay/.conda/envs/myenv2/lib/python3.7/site-packages/h5py/_hl/files.py", line 211, in make_fid
    fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 100, in h5py.h5f.open
FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = 'tutorial.h5', errno = 2, error message = 'No such file or directory', flags = 1, o_flags = 2)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "sample.py", line 23, in <module>
    backend.reset(nwalkers, ndim)
  File "/global/homes/m/mayamkay/.conda/envs/myenv2/lib/python3.7/site-packages/emcee/backends/hdf.py", line 113, in reset
    with self.open("a") as f:
  File "/global/homes/m/mayamkay/.conda/envs/myenv2/lib/python3.7/site-packages/emcee/backends/hdf.py", line 97, in open
    f = h5py.File(self.filename, mode)
  File "/global/homes/m/mayamkay/.conda/envs/myenv2/lib/python3.7/site-packages/h5py/_hl/files.py", line 447, in __init__
    swmr=swmr)
  File "/global/homes/m/mayamkay/.conda/envs/myenv2/lib/python3.7/site-packages/h5py/_hl/files.py", line 213, in make_fid
    fid = h5f.create(name, h5f.ACC_EXCL, fapl=fapl, fcpl=fcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 120, in h5py.h5f.create
OSError: [Errno 524] Unable to create file (unable to lock file, errno = 524, error message = 'Unknown error 524')
srun: error: nid13035: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=45509977.0
dfm commented 3 years ago

Thanks - that runs just fine on my Mac and on Google colab. It looks like you're launching with slurm so I bet you're running the same script multiple times in parallel by mistake and trying to write to the same file in parallel. I'm not sure that I can be super helpful with your specific environment setup, but I'd look into exactly how you're launching the script and how you're managing resources. Hope this helps!

mayamkay commented 3 years ago

Thanks so much for the help I really appreciate it. I'll try to fix the way I'm launching the script then!

mayamkay commented 3 years ago

For the sake of anyone else looking into this it turns out that running with HDF5 files in cori's $HOME tends to throw this issue. I switched to $SCRATCH and then it ran fine. Thanks again dfm for helping me trouble shoot this!