Closed abhimat closed 4 months ago
Show your run() arguments and use this link https://johannesbuchner.github.io/UltraNest/debugging.html#Finding-model-bugs to print out the likelihoods of a few thousand samples.
Do you have many more cores than live points?
Here are my setup and run calls:
sampler = ultranest.ReactiveNestedSampler(
param_names,
loglike = un_evaluate,
transform = param_priors.prior_transform_ultranest,
log_dir='./un_out/',
resume='resume',
storage_backend='csv',
)
result = sampler.run(
show_status=True,
min_num_live_points=400,
)
I encounter the same error when I take out the min_num_live_points
keyword to run()
. I encounter the same issue when I try this run with as few as 2 cores with mpiexec and with as many as 10 cores. In this particular run, I have 10 parameters.
Thanks for also suggesting the likelihood tests. Trying that, the likelihoods do appear reasonable. Here's a small subset from the entire run. I do not see any patterns of repeats or infinites:
[-664.8165481871674, -146.37418131046311, -11530.169660318177, -5346.746322528828, -253.4891123786993, -75.01568272157368, -2896.156242559588, -779.6444614131246, -26285.918677166206, -5288.06128830499, -78.06933285460897, -2341.7947872999653, -63.476539910330004, -5632.973315533837, -8215.620784556662, -12698.888294210228, -6604.00248930183, -1969.7067769495002, -2396.804783660609, -50.76145370937656, -34.972670203312624, -33.91679500550069, -1626.704536398551, -24874.69297577541, -11171.81194072132, -5575.902974800825, -346.28907797493287, -62.136840045871836, -1649.2920302625416, -4586.553885322061, -5676.509662166807, -1571.6418583865625, -3130.086958082822, -351.5394947071074, -13303.92677013245, -125.370340655367, -1394.592065480218, -312.08966596774826, -10.90495856246445, -795.2525840888214, -297.52317860875905, -1141.0829664148757, -9862.26648466667, -1e+300, -3406.7166054693043, -6170.172174121924, -1419.523120484545, -55.1889347680532, -1e+300, -3060.487459156065, -830.4860883353762, -1255.3527291448627, -196.94525358713966, -8242.68908329844, -3604.8302006889985, -2662.1616046621107, -2292.512103082137, -29410.201246074415, -273.1948201844567, -6138.449249781624, -10399.99816954593, -20.093680479665714, -4541.299718136764, -2220.3594106822075, -3139.505712897866, -91.55281874564254, -4780.225745817035, -3399.6266360832296, -1e+300, -790.3455195954318, -6915.568117680447, -25930.06312803873, -2090.365388377356, -6831.953090239464, -1405.796804100345, -6114.780343826242, -6824.592196129843, -1540.599259622464, -2718.759843796995, -4977.375003362302, -263.8516747819208, -2686.04725116381, -4035.0795579114633, -481.7597751827929, -11.035475124832088, -1043.5243773864559, -262.0773977960295, -57.15860130670064, -941.8958138054883, -4330.534356946371, -39819.49281791721, -79.07669640721905, -1247.2711149018344, -504.09082938460995, -1926.8178661443985, -6441.760230898692, -1316.5249265610996, -496.47028460157884, -5143.7909683410035, -1114.9759878914924
From my debugging,the error appears to be related to running with MPI, with the first process obtaining num_live_points_missing
correctly, but other processes obtaining an empty list after the following line (line 1503) in ReactiveNestedSampler._widen_roots()
in ultranest/integrator.py:
num_live_points_missing = self.comm.bcast(num_live_points_missing, root=0)
That's ... not how MPI should behave. You could try to reproduce this with a small test, for example the first in https://mpi4py.readthedocs.io/en/stable/tutorial.html#collective-communication
Maybe try activating a different MPI implementation on your server and see if that helps.
Yes, it's pretty strange! The mpi4py tutorial / example works fine on my server with 2, 4, 10 processes (i.e., the data
variable appears in all process ranks, not just the first one). Since the mpi4py example works okay, and the gauss.py example in the ultranest docs works okay with mpi, there may be something odd with how my likelihood evaluation is being calculated and possible interaction with MPI and ultranest…
I'll continue investigating this more, thanks for the feedback so far!
After investigating a bit more, the issue appears to arise when setting the seed
variable in the _setup_distributed_seeds()
function, which ultimately triggers the later issues I see…
In the 0 rank process, seed is set appropriately, but in all other processes seed is set to a dictionary: {'worker_command': 'release'}
, which causes an error that is caught in the try statement during the MPI import in ReactiveNestedSampler()
. That ultimately leads all processes to also have mpi rank 0 (from the except portion after that mpi4py import statement), leading to the later issues.
This only happens when I use ultranest and MPI with my sampler setup, but not when I use ultranest and MPI for the gauss.py example code in the docs so I'm still investigating what about my usage is causing this seed variable to not behave with MPI…!
I have determined the issue. I was using the PHOEBE package as part of my computation and it has its own MPI options that get set up during import. Even if I turn off the option to use MPI for phoebe, if it detects that it's a part of an MPI run it still broadcasts some variables through MPI, which was causing issues with ultranest! Removing that broadcast in PHOEBE fixes this issue. It clearly seems like a bug in their package, and I will file a bug report with them.
Thank you for your quick and helpful responses! I will close this issue since it is definitely not an issue with ultranest!
Description
When running ultranest in parallel via mpiexec on my own sampling code, I encounter the following error:
What I Did
This error appears to only happen when I run my sampling code in parallel via mpiexec.
mpiexec
command (I tried this code, as shown here in the docs).num_live_points_missing
error shows up.Where does the num_live_points_missing error originate, and is there any special way I am handling my likelihood or priors that might cause this to only pop up when trying parallelization via mpiexec?
Here is the full error and traceback copied below:
Thank you!