JohannesBuchner / UltraNest

Fit and compare complex models reliably and rapidly. Advanced nested sampling.
https://johannesbuchner.github.io/UltraNest/
Other
142 stars 30 forks source link

Runaway widening of roots #81

Closed matteobachetti closed 1 year ago

matteobachetti commented 1 year ago

Description

We made an implementation of Bexvar in Stingray some months ago, as you might remember. When testing ultranest-dependent code, we have a case where we feed it bad data and expect a warning. It used to just exit after a few seconds with gibberish results, which was pretty minor. Now, it starts an apparently infinite loop of widening more and more the roots, making the CI crash for timeout:

[ultranest] Sampling 400 live points from prior ...
[ultranest] Widening roots to 799 live points (have 400 already) ...
[ultranest] Sampling 399 live points from prior ...
[ultranest] Widening roots to 1597 live points (have 799 already) ...
[ultranest] Sampling 798 live points from prior ...
[ultranest] Widening roots to 3193 live points (have 1597 already) ...
[ultranest] Sampling 1596 live points from prior ...
[ultranest] Widening roots to 6385 live points (have 3193 already) ...
[ultranest] Sampling 3192 live points from prior ...
[ultranest] Widening roots to 12769 live points (have 6385 already) ...
[ultranest] Sampling 6384 live points from prior ...
[ultranest] Widening roots to 25537 live points (have 12769 already) ...
[ultranest] Sampling 12768 live points from prior ...
[ultranest] Widening roots to 51073 live points (have 25537 already) ...
[ultranest] Sampling 25536 live points from prior ...
[ultranest] Widening roots to 102145 live points (have 51073 already) ...
[ultranest] Sampling 51072 live points from prior ...
[ultranest] Widening roots to 204289 live points (have 102145 already) ...
[ultranest] Sampling 102144 live points from prior ...
[ultranest] Widening roots to 408577 live points (have 204289 already) ...
[ultranest] Sampling 204288 live points from prior ...
[ultranest] Widening roots to 817153 live points (have 408577 already) ...
[ultranest] Sampling 408576 live points from prior ...
[ultranest] Widening roots to 1634305 live points (have 817153 already) ...
[ultranest] Sampling 817152 live points from prior ...
[ultranest] Widening roots to 3268609 live points (have 1634305 already) ...
[ultranest] Sampling 1634304 live points from prior ...
[ultranest] Widening roots to 6537217 live points (have 3268609 already) ...
[ultranest] Sampling 3268608 live points from prior ...

What I Did

I think that the only think I did was feeding non-integer counts to bexvar. Again, the purpose was failing graciously

JohannesBuchner commented 1 year ago

Could you please attach the debug.log

I think this is could be caused by a likelihood plateau at the beginning of the run.

But I agree, this is not ideal behaviour. Probably I should add a maximum number of roots that is being expanded to.

JohannesBuchner commented 1 year ago

I am not successful in reproducing your bug with the latest stingray repo and ultranest 3.5.7 from pypi

I also tried reverting the commit c140664223b1211c7730d0b375fc83f0757d41a3, or going back this version https://github.com/StingraySoftware/stingray/commit/8ad7272fcd2a155bd6d2eeabe68be58bdec92742

Please let me know how I can reproduce the runaway behaviour.

matteobachetti commented 1 year ago

@JohannesBuchner I would also have gone to commit 8ad7272 of stingray. Up to that point, a simple pytest test_bexvar.py could have reproduced the result. Have you made any new modifications to Multinest that might have fixed the behavior?

JohannesBuchner commented 1 year ago

No

jpl-jengelke commented 1 year ago

We are experiencing this issue with EXOTIC, also, using UN 3.5.7.

To wit:

[ultranest] Widening roots to 6537217 live points (have 3268609 already) ... [ultranest] Sampling 3268608 live points from prior ... It does an initial sampling of the parameter space to estimate where to do a constrained search but the number of live points gets so large that it needs more initial samples than it usually takes to converge. ... I've been noticing some TESS light curves requiring over a million function calls when it only takes ~10,000 to converge. ...

This is perhaps okay for a single reduction but when running against massive data sets, it brings our applications to their knees. ⚔️

Version 3.5.6 seems to be working fine.

Discovered by @pearsonkyle in EXOTIC. Please contact him for more info.

JohannesBuchner commented 1 year ago

Context

This happens when you have a large fraction of the prior parameter space with the same loglikelihood, i.e., a plateau. Such plateaus need to be handled in a special way in nested sampling, otherwise you get biases, as this paper https://arxiv.org/abs/2005.08602 pointed out. This paper https://arxiv.org/abs/2010.13884 discusses a strategy, which is implemented in ultranest. In particular, the live points need to be discarded together without replacement until the plateau is crossed. But this causes a reduction of the live points, and the subsequent run would have few live points available, making it both inefficient and probably also return poor posteriors. Hence the widening of the initial live point population to hopefully have a reasonable number remaining.

Solution

You can avoid this by defining a log-likelihood that does not have plateaus. Probably you are returning a low value when the parameters are problematic/unphysical. Instead, return something which increases towards where the good region is.

For example, let's say you have two parameters where the sum must be below 1. Replace this:

if params[0] + params[1] > 1:
     return -1e300

with:

if params[0] + params[1] > 1:
     return -1e300 * (params[0] + params[1])

Mitigation

Probably there are likelihoods where all values are identical. For example, a no-data case. Probably we should put in a limit to the widening as an additional parameter, with a clear warning and instructions how to improve things (like the above). Maybe warn at 100,000 and stop trying at 500,000?

JohannesBuchner commented 1 year ago

Hi all, could you please have a look and test this pull request: https://github.com/JohannesBuchner/UltraNest/pull/96

If it is suitable and works (adds a useful print to the output, avoids infinite looping), I would merge it and make a new release.

JohannesBuchner commented 1 year ago

This should work now. There are two unit tests, a warning when the run-away seems to occur, and here is how to configure it:

     sampler.run(....,
            widen_before_initial_plateau_num_warn=10000,
            widen_before_initial_plateau_num_max=50000,
      )

I created a release (3.6.0), please test it and let me know if it works for you.

matteobachetti commented 1 year ago

Hi @JohannesBuchner, sorry for missing the previous message, and thanks for the change! I will update ultranest and let you know if the problem appears again.