JohannesBuchner / UltraNest

Fit and compare complex models reliably and rapidly. Advanced nested sampling.
https://johannesbuchner.github.io/UltraNest/
Other
153 stars 30 forks source link

`RuntimeWarning: invalid value encountered in divide` seems to crash UltraNest #130

Closed timothygebhard closed 6 months ago

timothygebhard commented 6 months ago

Description

Hi @JohannesBuchner!

Sorry to bother you again, but I am encountering some unexpected behaviors with UltraNest that I can't seem to figure out by myself. Any help would be greatly appreciated — thanks a lot in advance!

What I Did

When I launch a nested sampling run with UltraNest using the standard ReactiveNestedSampler interface and MLFriends as the region class (switching to RobustEllipsoidSampler makes no difference at all), it almost immediately crashes with:

[ultranest] Sampling 64 live points from prior ...
[ultranest] Widening roots to 71 live points (have 64 already) ...
[ultranest] Sampling 7 live points from prior ...
Traceback (most recent call last):
  File "/lustre/home/tgebhard/projects/fm4ar/scripts/nested_sampling/run_nested_sampling.py", line 261, in <module>
    runtime = sampler.run(
  File "/lustre/home/tgebhard/projects/fm4ar/fm4ar/nested_sampling/samplers.py", line 710, in run
    self.sampler.run(
  File "/home/tgebhard/.virtualenvs/fm4ar/lib/python3.10/site-packages/ultranest/integrator.py", line 2373, in run
    for result in self.run_iter(
  File "/home/tgebhard/.virtualenvs/fm4ar/lib/python3.10/site-packages/ultranest/integrator.py", line 2472, in run_iter
    self._widen_roots_beyond_initial_plateau(
  File "/home/tgebhard/.virtualenvs/fm4ar/lib/python3.10/site-packages/ultranest/integrator.py", line 1419, in _widen_roots_beyond_initial_plateau
    self._widen_roots(nroots_needed)
  File "/home/tgebhard/.virtualenvs/fm4ar/lib/python3.10/site-packages/ultranest/integrator.py", line 1559, in _widen_roots
    self.build_tregion = not is_affine_transform(active_u, active_v)
  File "/home/tgebhard/.virtualenvs/fm4ar/lib/python3.10/site-packages/ultranest/utils.py", line 336, in is_affine_transform
    slopes = (b2 - b1) / (a2 - a1)
RuntimeWarning: invalid value encountered in divide

<The warning is repeated as many times as there are processes.>

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[35060,1],1]
  Exit code:    1
--------------------------------------------------------------------------

<Execution terminates>

The debug.log only contains:

20:37:24 [ultranest] [DEBUG] ReactiveNestedSampler: dims=16+0, resume=True, log_dir=/home/tgebhard/projects/fm4ar/experiments/nested-sampling/ultranest/sigma-0.125754__R-400__4k, backend=hdf5, vectorized=False, nbootstraps=30, ndraw=128..65536
20:37:30 [ultranest] [INFO] Sampling 64 live points from prior ...
20:37:43 [ultranest] [DEBUG] Found plateau of 8/64 initial points at L=-1.48792e+09. Avoid this by a continuously increasing loglikelihood towards good regions.
20:37:43 [ultranest] [INFO] Widening roots to 71 live points (have 64 already) ...
20:37:43 [ultranest] [INFO] Sampling 7 live points from prior ...

The above traceback was obtained from running with mpiexec, but running the wrapper script directly (i.e., without parallelization) makes no difference, it crashes with the same error.

Changing the number of live points also seems to have no real effect (I tried various values between 1 and 10k). Moreover, it does not make any difference if I try to resume a (crashed) run or start with a clean working directory.

My own attempts to understand this have led me to the warning_errors=True flag for Cython in mlfriends.pyx: I suspected that this might turn the RuntimeWarning into a proper error, which then crashes the script. However, when I remove this from the *.pyx file and reinstall UltraNest from source, the problem still persists (I don't have much experience with Cython, though, so maybe I did something wrong here?).

FWIW, running the same exact code on a smaller instance of my problem (which I obtain by fixing some of my target parameters to their "true" value) works without any issues. Also, any likelihood plateaus that the sampler encounters are "real" in the sense that I am not enforcing any constraints on the parameters by returning very low likelihoods for illegal combinations.

Again, thanks a lot in advance for any help with this! 🙂

JohannesBuchner commented 6 months ago

Can you edit ultranest/integrator.py line 1559 to print out active_u and active_v there before it crashes?

Can it be that you have a weird transform which collapses parameters onto the same value (e.g., rounding)?

JohannesBuchner commented 6 months ago

warning_errors=True is related to compilation warnings, I think.

timothygebhard commented 6 months ago

Can you edit ultranest/integrator.py line 1559 to print out active_u and active_v there before it crashes?

Here is the output:

active_u: [[0.00695213 0.5107473  0.417411   0.22210781 0.11986537 0.33761517
  0.9429097  0.32320293 0.51879062 0.70301896 0.3636296  0.97178208
  0.96244729 0.2517823  0.49724851 0.30087831]
 [0.00695213 0.5107473  0.417411   0.22210781 0.11986537 0.33761517
  0.9429097  0.32320293 0.51879062 0.70301896 0.3636296  0.97178208
  0.96244729 0.2517823  0.49724851 0.30087831]
 [0.00695213 0.5107473  0.417411   0.22210781 0.11986537 0.33761517
  0.9429097  0.32320293 0.51879062 0.70301896 0.3636296  0.97178208
  0.96244729 0.2517823  0.49724851 0.30087831]
 [0.00695213 0.5107473  0.417411   0.22210781 0.11986537 0.33761517
  0.9429097  0.32320293 0.51879062 0.70301896 0.3636296  0.97178208
  0.96244729 0.2517823  0.49724851 0.30087831]
 [0.00695213 0.5107473  0.417411   0.22210781 0.11986537 0.33761517
  0.9429097  0.32320293 0.51879062 0.70301896 0.3636296  0.97178208
  0.96244729 0.2517823  0.49724851 0.30087831]
 [0.00695213 0.5107473  0.417411   0.22210781 0.11986537 0.33761517
  0.9429097  0.32320293 0.51879062 0.70301896 0.3636296  0.97178208
  0.96244729 0.2517823  0.49724851 0.30087831]
 [0.00695213 0.5107473  0.417411   0.22210781 0.11986537 0.33761517
  0.9429097  0.32320293 0.51879062 0.70301896 0.3636296  0.97178208
  0.96244729 0.2517823  0.49724851 0.30087831]]

active_v: [[ 1.10428196e-01  3.22419077e-02 -2.24330097e+00 -1.56704423e+00
  -1.90444429e+00  3.37615171e+00  1.25432776e+01  1.68024572e+00
   3.81576718e+00  1.67332085e+00  1.02725920e+03  9.71782083e-01
   9.62447295e-01  2.51782296e-01  1.49724851e+00  3.00878310e-01]
 [ 1.10428196e-01  3.22419077e-02 -2.24330097e+00 -1.56704423e+00
  -1.90444429e+00  3.37615171e+00  1.25432776e+01  1.68024572e+00
   3.81576718e+00  1.67332085e+00  1.02725920e+03  9.71782083e-01
   9.62447295e-01  2.51782296e-01  1.49724851e+00  3.00878310e-01]
 [ 1.10428196e-01  3.22419077e-02 -2.24330097e+00 -1.56704423e+00
  -1.90444429e+00  3.37615171e+00  1.25432776e+01  1.68024572e+00
   3.81576718e+00  1.67332085e+00  1.02725920e+03  9.71782083e-01
   9.62447295e-01  2.51782296e-01  1.49724851e+00  3.00878310e-01]
 [ 1.10428196e-01  3.22419077e-02 -2.24330097e+00 -1.56704423e+00
  -1.90444429e+00  3.37615171e+00  1.25432776e+01  1.68024572e+00
   3.81576718e+00  1.67332085e+00  1.02725920e+03  9.71782083e-01
   9.62447295e-01  2.51782296e-01  1.49724851e+00  3.00878310e-01]
 [ 1.10428196e-01  3.22419077e-02 -2.24330097e+00 -1.56704423e+00
  -1.90444429e+00  3.37615171e+00  1.25432776e+01  1.68024572e+00
   3.81576718e+00  1.67332085e+00  1.02725920e+03  9.71782083e-01
   9.62447295e-01  2.51782296e-01  1.49724851e+00  3.00878310e-01]
 [ 1.10428196e-01  3.22419077e-02 -2.24330097e+00 -1.56704423e+00
  -1.90444429e+00  3.37615171e+00  1.25432776e+01  1.68024572e+00
   3.81576718e+00  1.67332085e+00  1.02725920e+03  9.71782083e-01
   9.62447295e-01  2.51782296e-01  1.49724851e+00  3.00878310e-01]
 [ 1.10428196e-01  3.22419077e-02 -2.24330097e+00 -1.56704423e+00
  -1.90444429e+00  3.37615171e+00  1.25432776e+01  1.68024572e+00
   3.81576718e+00  1.67332085e+00  1.02725920e+03  9.71782083e-01
   9.62447295e-01  2.51782296e-01  1.49724851e+00  3.00878310e-01]]

Can it be that you have a weird transform which collapses parameters onto the same value (e.g., rounding)?

I don't think so. My prior is uniform and the parameters are fully independent, so mapping from a unit hypercube to the prior space is about as simple as it gets. No rounding either. Or am I misunderstanding which transform you are talking about?

warning_errors=True is related to compilation warnings, I think.

Oh right, that makes sense!

JohannesBuchner commented 6 months ago

All of these are the same live point. Something went very wrong here.

JohannesBuchner commented 6 months ago

Maybe print out in your transform and likelihood which points are sampled.

timothygebhard commented 6 months ago

I think I figured it out: It looks like I accidentally placed a np.random.seed() call[^1] at a location in my code that caused all generated live points to be the same 🤦‍♂️ After fixing this, the issue described above is resolved. Unfortunately, the sampler still doesn't run properly, but I think the remaining problem is related to my other issue about limiting the runtime of the sampler...

[^1]: Side note: What is actually the recommended way for setting a seed for UltraNest? I noticed neither the sampler class itself nor the run() command take a seed argument, which is what caused me to resort to setting a global seed for numpy to begin with.

JohannesBuchner commented 6 months ago

You should set the seed at the beginning, before instantiating the sampler. If you are using MPI, ultranest assigns a seed to non-zero rank processes.