Closed astrojaket closed 2 years ago
This is a problem for many file systems on managed clusters. It is a h5py problem relating to locking files on the cluster. Some clusters do not allow it.
This is a part of the hdf5 library. It happens when multiple processes try to access the same file.
Either you have another process running ultranest on the same log_dir, or you do not have mpi4py installed?
You can test your MPI setup with:
mpiexec -np 4 python3 -c 'from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank(), MPI.COMM_WORLD.Get_size())'
which should give something similar to:
0 4
3 4
1 4
2 4
:
Once you are sure that is all good, see https://github.com/h5py/h5py/issues/1101 .
It looks like it is a problem with MPI on the cluster. When I try your command I get the following
(exoplanet) [jtaylor@idefix ~]$ mpiexec -np 4 python -c 'from mpi4py import MPI; print(MPI.COMM_WORLD.Get_rank(), MPI.COMM_WORLD.Get_size())'
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: /usr/local/intel2015//impi/5.0.1.035/intel64/lib/libmpifort.so.12: undefined symbol: MPI_UNWEIGHTED
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[10993,1],1]
Exit code: 1
--------------------------------------------------------------------------
As a side question. Does mpiexec run the same as mpirun? Using mpirun works with the csv saving style but I'm concerned it is just running X independent jobs.
I suspect but I don't know, check with your cluster admin team. It also depends on the MPI implementation.
You can try switching MPI implementation, maybe that fixes things.
I will close this now, but feel free to reopen if you believe there is a UltraNest bug.
Description
I have been trying to use ultranest with the h5py mode. When using on a single core it works fine, however when I switch to using MPI I get the error which I have presented below. I do not get these issues when I switch the storage to "csv". I was wondering if you knew how to solve this issue?