Open mtagliazucchi opened 2 weeks ago
It would be better if you do not do MPI within your likelihood (openmp is fine), because ultranest already runs hundreds of likelihood evaluations in parallel, so MPI will just be confused by another parallelisation.
The error is strange. Can you paste the log file from the log_dir?
Hi, thank you for the reply! Unfortunately, I cannot use openmp since in the real scenario I need to parallelize the likelihood over hundreds of cores, meaning over multiple nodes of a cluster.
To be exact, the script doesn't stop. It prints the message I wrote in the first post as many times as MPI processes are started, and gets stuck there. I have to stop it manually.
The .log file content is:
13:49:47 [ultranest] [DEBUG] ReactiveNestedSampler: dims=1+0, resume=False, log_dir=ultranest/run1, backend=hdf5, vectorized=True, nbootstraps=30, ndraw=128..65536
13:49:47 [ultranest] [INFO] Sampling 400 live points from prior ...
It seems you are resuming and this line is causing the issue:
https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/integrator.py#L615
Maybe it should be 0. Maybe you can try and put some prints (with self.mpi_size
and self.mpi_rank
) into the function.
I don't quite see the benefit of using MPI within your likelihood when you can have ultranest parallelise the likelihood, which will save communication overhead. Are there memory constraints that make you prefer MPI parallelisation within the likelihood and serial evaluation by ultranest?
I don't quite see the benefit of using MPI within your likelihood when you can have ultranest parallelise the likelihood, which will save communication overhead. Are there memory constraints that make you prefer MPI parallelisation within the likelihood and serial evaluation by ultranest?
I'm not sure if ultranest is capable of parallelizing the likelihood as I do in the example code.
In the above example, which still reflects a real case scenario, the computational bottleneck is the dimension of the data
being considered. I therefore split the data array into data chunks, compute the likelihood on each chunk and then combine the results. Can ultranest parallelization do this?
I have an update on my problem. If I remove the lines in the original code
# try to disable ultranest MPI parallelization
sampler.use_mpi=False
sampler.mpi_size=1
sampler.mpi_rank=0
the assertion error disappers, but then code is stucked here
14:44:47 [ultranest] [DEBUG] ReactiveNestedSampler: dims=1+0, resume=False, log_dir=ultranest/run2, backend=hdf5, vectorized=True, nbootstraps=30, ndraw=128..65536
14:44:47 [ultranest] [INFO] Sampling 400 live points from prior ...
14:44:48 [ultranest] [DEBUG] Found plateau of 4/400 initial points at L=-97.371. Avoid this by a continuously increasing loglikelihood towards good regions.
14:44:48 [ultranest] [INFO] Widening roots to 403 live points (have 400 already) ...
14:44:48 [ultranest] [INFO] Sampling 3 live points from prior ...
It seems you are resuming and this line is causing the issue: https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/integrator.py#L615 Maybe it should be 0. Maybe you can try and put some prints (with self.mpi_size and self.mpi_rank) into the function.
I tried to set the value on that line to be zero and also on line 1516 (with options sampler.use_mpi=False, sampler.mpi_size=1, sampler.mpi_rank=0
as in the first case). The code still doesn't work and is pending here
File "/home/mt/softwares/chimera_dev/CHIMERA_JAX/examples/test/test_mpi_like_generic.py", line 91, in <module>
result = sampler.run()
^^^^^^^^^^^^^
File "/home/mt/softwares/miniconda3/envs/chimeragw-stable/lib/python3.11/site-packages/ultranest/integrator.py", line 2560, in run_iter
for _result in self.run_iter(
File "/home/mt/softwares/miniconda3/envs/chimeragw-stable/lib/python3.11/site-packages/ultranest/integrator.py", line 2560, in run_iter
self._widen_roots_beyond_initial_plateau(
File "/home/mt/softwares/miniconda3/envs/chimeragw-stable/lib/python3.11/site-packages/numpy/core/fromnumeric.py", line 2953, in min
Lmin = np.min(Ls)
^^^^^^^^^^
ValueError: zero-size array to reduction operation minimum which has no identity
return _wrapreduction(a, np.minimum, 'min', axis, None, out,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mt/softwares/miniconda3/envs/chimeragw-stable/lib/python3.11/site-packages/numpy/core/fromnumeric.py", line 88, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: zero-size array to reduction operation minimum which has no identity
Hi!
I'm trying to use
UltraNest
with a very expensive likelihood whose evaluation at a single point of the parameter space needs to be parallelized using MPI. The likelihood is automatically vectorized.However, if I'm not mistaken,
UltraNest
uses MPI to parallelize some internal computation/live point proposal. This conflict causes a bug in my program.Here's a dummy code that reproduces a heavy likelihood parallelized using MPI:
I was able to use this likelihood with
emcee
in the following way:The idea of this script is that each MPI process run an
emcee
sampler so that each of them calls the likelihood function. This is necessary, otherwise the non-root ranks never compute the likelihood on their chunk of data. However, only the root process stores the results and prints the progress. Also, inside the likelihood function the parameters used to compute the likelihood by each rank are forced to be those of the root for consistency between the various sampler. This example works and there is effectively a large speed up in the code.I can't produce anything similar with
UltraNest
. I tried with this code here:The program runs when started with a single MPI process (
mpirun -np 1 python test.py
), but fails fornp > 1
with error:Does anyone know a workaround for this problem? Thanks!