num_live_points_missing error when running ultranest in parallel via mpiexec

abhimat commented 4 months ago

UltraNest version: 4.2.0
Python version: 3.8.16
Operating System: Mac OS 12.7

Description

When running ultranest in parallel via mpiexec on my own sampling code, I encounter the following error:

assert num_live_points_missing >= 0

What I Did

This error appears to only happen when I run my sampling code in parallel via mpiexec.

I can run the ultranest sampling code fine when running serially.
I can also run the example gauss.py code that is shown in the documentation fine with mpi using the mpiexec command (I tried this code, as shown here in the docs).
I am not able to run my own sampling function in parallel with mpiexec. Every time I do so, a few seconds after the run stars the num_live_points_missing error shows up.

Where does the num_live_points_missing error originate, and is there any special way I am handling my likelihood or priors that might cause this to only pop up when trying parallelization via mpiexec?

Here is the full error and traceback copied below:

Traceback (most recent call last):
  File "./run_un_fit.py", line 305, in <module>
    result = sampler.run(
  File "/Users/abhimatlocal/software/miniforge3/envs/phoebe_py38/lib/python3.8/site-packages/ultranest/integrator.py", line 2373, in run
    for result in self.run_iter(
  File "/Users/abhimatlocal/software/miniforge3/envs/phoebe_py38/lib/python3.8/site-packages/ultranest/integrator.py", line 2472, in run_iter
    self._widen_roots_beyond_initial_plateau(
  File "/Users/abhimatlocal/software/miniforge3/envs/phoebe_py38/lib/python3.8/site-packages/ultranest/integrator.py", line 1419, in _widen_roots_beyond_initial_plateau
    self._widen_roots(nroots_needed)
  File "/Users/abhimatlocal/software/miniforge3/envs/phoebe_py38/lib/python3.8/site-packages/ultranest/integrator.py", line 1508, in _widen_roots
    assert num_live_points_missing >= 0
AssertionError

Thank you!

JohannesBuchner commented 4 months ago

Show your run() arguments and use this link https://johannesbuchner.github.io/UltraNest/debugging.html#Finding-model-bugs to print out the likelihoods of a few thousand samples.

Do you have many more cores than live points?

abhimat commented 4 months ago

Here are my setup and run calls:

sampler = ultranest.ReactiveNestedSampler(
    param_names,
    loglike = un_evaluate,
    transform = param_priors.prior_transform_ultranest,
    log_dir='./un_out/',
    resume='resume',
    storage_backend='csv',
)

result = sampler.run(
    show_status=True,
    min_num_live_points=400,
)

I encounter the same error when I take out the min_num_live_points keyword to run(). I encounter the same issue when I try this run with as few as 2 cores with mpiexec and with as many as 10 cores. In this particular run, I have 10 parameters.

abhimat commented 4 months ago

Thanks for also suggesting the likelihood tests. Trying that, the likelihoods do appear reasonable. Here's a small subset from the entire run. I do not see any patterns of repeats or infinites:

[-664.8165481871674, -146.37418131046311, -11530.169660318177, -5346.746322528828, -253.4891123786993, -75.01568272157368, -2896.156242559588, -779.6444614131246, -26285.918677166206, -5288.06128830499, -78.06933285460897, -2341.7947872999653, -63.476539910330004, -5632.973315533837, -8215.620784556662, -12698.888294210228, -6604.00248930183, -1969.7067769495002, -2396.804783660609, -50.76145370937656, -34.972670203312624, -33.91679500550069, -1626.704536398551, -24874.69297577541, -11171.81194072132, -5575.902974800825, -346.28907797493287, -62.136840045871836, -1649.2920302625416, -4586.553885322061, -5676.509662166807, -1571.6418583865625, -3130.086958082822, -351.5394947071074, -13303.92677013245, -125.370340655367, -1394.592065480218, -312.08966596774826, -10.90495856246445, -795.2525840888214, -297.52317860875905, -1141.0829664148757, -9862.26648466667, -1e+300, -3406.7166054693043, -6170.172174121924, -1419.523120484545, -55.1889347680532, -1e+300, -3060.487459156065, -830.4860883353762, -1255.3527291448627, -196.94525358713966, -8242.68908329844, -3604.8302006889985, -2662.1616046621107, -2292.512103082137, -29410.201246074415, -273.1948201844567, -6138.449249781624, -10399.99816954593, -20.093680479665714, -4541.299718136764, -2220.3594106822075, -3139.505712897866, -91.55281874564254, -4780.225745817035, -3399.6266360832296, -1e+300, -790.3455195954318, -6915.568117680447, -25930.06312803873, -2090.365388377356, -6831.953090239464, -1405.796804100345, -6114.780343826242, -6824.592196129843, -1540.599259622464, -2718.759843796995, -4977.375003362302, -263.8516747819208, -2686.04725116381, -4035.0795579114633, -481.7597751827929, -11.035475124832088, -1043.5243773864559, -262.0773977960295, -57.15860130670064, -941.8958138054883, -4330.534356946371, -39819.49281791721, -79.07669640721905, -1247.2711149018344, -504.09082938460995, -1926.8178661443985, -6441.760230898692, -1316.5249265610996, -496.47028460157884, -5143.7909683410035, -1114.9759878914924

abhimat commented 4 months ago

From my debugging,the error appears to be related to running with MPI, with the first process obtaining num_live_points_missing correctly, but other processes obtaining an empty list after the following line (line 1503) in ReactiveNestedSampler._widen_roots() in ultranest/integrator.py:

num_live_points_missing = self.comm.bcast(num_live_points_missing, root=0)

JohannesBuchner commented 4 months ago

That's ... not how MPI should behave. You could try to reproduce this with a small test, for example the first in https://mpi4py.readthedocs.io/en/stable/tutorial.html#collective-communication

Maybe try activating a different MPI implementation on your server and see if that helps.

abhimat commented 4 months ago

Yes, it's pretty strange! The mpi4py tutorial / example works fine on my server with 2, 4, 10 processes (i.e., the data variable appears in all process ranks, not just the first one). Since the mpi4py example works okay, and the gauss.py example in the ultranest docs works okay with mpi, there may be something odd with how my likelihood evaluation is being calculated and possible interaction with MPI and ultranest…

I'll continue investigating this more, thanks for the feedback so far!

abhimat commented 4 months ago

After investigating a bit more, the issue appears to arise when setting the seed variable in the _setup_distributed_seeds() function, which ultimately triggers the later issues I see…

In the 0 rank process, seed is set appropriately, but in all other processes seed is set to a dictionary: {'worker_command': 'release'}, which causes an error that is caught in the try statement during the MPI import in ReactiveNestedSampler(). That ultimately leads all processes to also have mpi rank 0 (from the except portion after that mpi4py import statement), leading to the later issues.

This only happens when I use ultranest and MPI with my sampler setup, but not when I use ultranest and MPI for the gauss.py example code in the docs so I'm still investigating what about my usage is causing this seed variable to not behave with MPI…!

abhimat commented 4 months ago

I have determined the issue. I was using the PHOEBE package as part of my computation and it has its own MPI options that get set up during import. Even if I turn off the option to use MPI for phoebe, if it detects that it's a part of an MPI run it still broadcasts some variables through MPI, which was causing issues with ultranest! Removing that broadcast in PHOEBE fixes this issue. It clearly seems like a bug in their package, and I will file a bug report with them.

Thank you for your quick and helpful responses! I will close this issue since it is definitely not an issue with ultranest!

JohannesBuchner / UltraNest

num_live_points_missing error when running ultranest in parallel via mpiexec #125

Description

What I Did