JohannesBuchner / UltraNest

Fit and compare complex models reliably and rapidly. Advanced nested sampling.
https://johannesbuchner.github.io/UltraNest/
Other
142 stars 30 forks source link

Issues with using h5py when running MPI #61

Closed astrojaket closed 2 years ago

astrojaket commented 2 years ago

Description

I have been trying to use ultranest with the h5py mode. When using on a single core it works fine, however when I switch to using MPI I get the error which I have presented below. I do not get these issues when I switch the storage to "csv". I was wondering if you knew how to solve this issue?

Traceback (most recent call last):
  File "call_dynesty.py", line 132, in <module>
    sampler = ultranest.ReactiveNestedSampler(param_names, loglike, prior_transform,log_dir="myanalysis", resume=True)
  File "/home/jtaylor/.conda/envs/exoplanet/lib/python3.7/site-packages/ultranest/integrator.py", line 1076, in __init__
    self.pointstore = HDF5PointStore(storage_filename, storage_num_cols, mode='a' if resume else 'w')
  File "/home/jtaylor/.conda/envs/exoplanet/lib/python3.7/site-packages/ultranest/store.py", line 187, in __init__
    self.fileobj = h5py.File(filepath, **h5_file_args)
  File "/home/jtaylor/.conda/envs/exoplanet/lib/python3.7/site-packages/h5py/_hl/files.py", line 447, in __init__
    swmr=swmr)
  File "/home/jtaylor/.conda/envs/exoplanet/lib/python3.7/site-packages/h5py/_hl/files.py", line 211, in make_fid
    fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 100, in h5py.h5f.open
BlockingIOError: [Errno 11] Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
erinaldi commented 2 years ago

This is a problem for many file systems on managed clusters. It is a h5py problem relating to locking files on the cluster. Some clusters do not allow it.

JohannesBuchner commented 2 years ago

This is a part of the hdf5 library. It happens when multiple processes try to access the same file.

Either you have another process running ultranest on the same log_dir, or you do not have mpi4py installed?

You can test your MPI setup with:

mpiexec -np 4 python3 -c 'from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank(), MPI.COMM_WORLD.Get_size())'

which should give something similar to:

0 4
3 4
1 4
2 4

:

JohannesBuchner commented 2 years ago

Once you are sure that is all good, see https://github.com/h5py/h5py/issues/1101 .

astrojaket commented 2 years ago

It looks like it is a problem with MPI on the cluster. When I try your command I get the following

(exoplanet) [jtaylor@idefix ~]$ mpiexec -np 4 python -c 'from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank(), MPI.COMM_WORLD.Get_size())'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: /usr/local/intel2015//impi/5.0.1.035/intel64/lib/libmpifort.so.12: undefined symbol: MPI_UNWEIGHTED
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[10993,1],1]
  Exit code:    1
--------------------------------------------------------------------------

As a side question. Does mpiexec run the same as mpirun? Using mpirun works with the csv saving style but I'm concerned it is just running X independent jobs.

JohannesBuchner commented 2 years ago

I suspect but I don't know, check with your cluster admin team. It also depends on the MPI implementation.

You can try switching MPI implementation, maybe that fixes things.

JohannesBuchner commented 2 years ago

I will close this now, but feel free to reopen if you believe there is a UltraNest bug.