Closed mayamkay closed 3 years ago
The backend and pool are definitely compatible - that's a very common use case and the backend only runs on the main thread. Please share an executable snippet of code (the one you shared is missing a lot of definitions) that reproduces the issue you're seeing.
Thanks for helping with this!
Here's a snippet that gives the same error I've been getting.
import emcee
import numpy as np
from multiprocessing import Pool
np.random.seed(42)
# The definition of the log probability function
# We'll also use the "blobs" feature to track the "log prior" for each step
def log_prob(theta):
log_prior = -0.5 * np.sum((theta - 1.0) ** 2 / 100.0)
log_prob = -0.5 * np.sum(theta ** 2) + log_prior
return log_prob, log_prior
# Initialize the walkers
coords = np.random.randn(32, 5)
nwalkers, ndim = coords.shape
# Set up the backend
# Don't forget to clear it in case the file already exists
filename = "tutorial.h5"
backend = emcee.backends.HDFBackend(filename)
backend.reset(nwalkers, ndim)
# set up the sampler
with Pool() as pool:
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_prob, backend=backend, pool=pool)
start = time.time()
sampler.run_mcmc(coords, maxN)
end = time.time()
multi_time = end - start
print("Multiprocessing took {0:.1f} seconds".format(multi_time))
The above works if I take out the backends part but with that included I get the following error:
Traceback (most recent call last):
File "/global/homes/m/mayamkay/.conda/envs/myenv2/lib/python3.7/site-packages/h5py/_hl/files.py", line 211, in make_fid
fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 100, in h5py.h5f.open
FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = 'tutorial.h5', errno = 2, error message = 'No such file or directory', flags = 1, o_flags = 2)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "sample.py", line 23, in <module>
backend.reset(nwalkers, ndim)
File "/global/homes/m/mayamkay/.conda/envs/myenv2/lib/python3.7/site-packages/emcee/backends/hdf.py", line 113, in reset
with self.open("a") as f:
File "/global/homes/m/mayamkay/.conda/envs/myenv2/lib/python3.7/site-packages/emcee/backends/hdf.py", line 97, in open
f = h5py.File(self.filename, mode)
File "/global/homes/m/mayamkay/.conda/envs/myenv2/lib/python3.7/site-packages/h5py/_hl/files.py", line 447, in __init__
swmr=swmr)
File "/global/homes/m/mayamkay/.conda/envs/myenv2/lib/python3.7/site-packages/h5py/_hl/files.py", line 213, in make_fid
fid = h5f.create(name, h5f.ACC_EXCL, fapl=fapl, fcpl=fcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 120, in h5py.h5f.create
OSError: [Errno 524] Unable to create file (unable to lock file, errno = 524, error message = 'Unknown error 524')
srun: error: nid13035: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=45509977.0
Thanks - that runs just fine on my Mac and on Google colab. It looks like you're launching with slurm so I bet you're running the same script multiple times in parallel by mistake and trying to write to the same file in parallel. I'm not sure that I can be super helpful with your specific environment setup, but I'd look into exactly how you're launching the script and how you're managing resources. Hope this helps!
Thanks so much for the help I really appreciate it. I'll try to fix the way I'm launching the script then!
For the sake of anyone else looking into this it turns out that running with HDF5 files in cori's $HOME tends to throw this issue. I switched to $SCRATCH and then it ran fine. Thanks again dfm for helping me trouble shoot this!
General information:
Problem description: I'm trying to find a way to use multiprocessing but also save steps of the chain such that if the run crashes or doesn't converge I can continue running it from where it left off. I've been trying to do this using backends and pool but it seems the two aren't compatible. Is there an alternative approach?
Expected behavior:
Save emcee chain using backends with pool
Actual behavior:
The file can't be accessed by the different threads so it fails
What have you tried so far?:
Minimal example:
< from multiprocessing import Pool
with Pool() as pool: sampler = emcee.EnsembleSampler(Nens, ndims, logposterior, args=argslist, pool=pool, backend=backend)) # start = time.time()