Upon running with 'resume', Ultranest terminates the run with the message 'No changes made. Probably the strategy was to explore in the remainder, but it is irrelevant already; try decreasing frac_remain.'

garvitagarwal290 commented 1 year ago

UltraNest version: 3.4.6
Python version: 3.9.7
Operating System: Rocky Linux 8.5 (Green Obsidian)

Description

I have time series data, and I am fitting them using models of 9 and 15 parameters. The fitting works okay for most of the data, but at the same time, for a large portion of them, the fitting runs for 96 hours, at which point it hits the wall-time of the HPC that I am using. Then I rerun the fitting of these time series using the 'resume' feature. The program runs for about 10 minutes, but then it terminates (with Exit_status = 0), and the final output of the run looks like the following.

[ultranest] Likelihood function evaluations: 428273387
[ultranest] Writing samples and results to disk ...
[ultranest] Writing samples and results to disk ... done
[ultranest] No changes made. Probably the strategy was to explore in the remainder, but it is irrelevant already; try decreasing frac_remain.
[ultranest] done iterating.

logZ = -21590.640 +- 1.022
  single instance: logZ = -21590.640 +- 0.235
  bootstrapped   : logZ = -21602.796 +- 0.751
  tail           : logZ = +- 0.693
insert order U test : converged: True correlation: inf iterations

    per_bary            : 56.131191572319658│                   ▇                   │57.231191572319652    56.681191572319669 +- 0.000000000000014
    a_bary              : 44.586│                   ▇                   │45.686    45.136 +- 0.000
    r_planet            : 0.065 │                   ▇                   │0.183     0.125 +- 0.000
    b_bary              : 0.000001000000000000│         ▇                             │0.731980832173530938    0.181980832173530810 +- 0.000000000000000056
    ecc_bary            : 0.073 │                       ▇               │1.000     0.623 +- 0.000
    w_bary              : 331.093│                   ▇                   │332.193    331.643 +- 0.000
    t0_bary_offset      : -0.050│                   ▇                   │0.050     -0.001 +- 0.000
    M_planet            305550513862216355811887677440 +- 35184372088832
    r_moon              : 0.000000123800000000│                      ▇                │0.123799876200000006    0.069941430543806068 +- 0.000000000000000014
    per_moon            : 0.612 │                   ▇                   │1.712     1.162 +- 0.000
    tau_moon            : 0.0000010000000000000│   ▇                                   │0.5984283447503323528    0.0484283447503322806 +- 0.0000000000000000069
    Omega_moon          : 211.584305462244203│                   ▇                   │212.684305462244225    212.134305462244271 +- 0.000000000000057
    i_moon              : 96.861177229315487│                   ▇                   │97.961177229315510    97.411177229315484 +- 0.000000000000014
    M_moon              577461808577478822312542208 +- 343597383680
    q1                  : 0.084 │                       ▇               │1.000     0.634 +- 0.000
    q2                  : 0.0000010000000000000│  ▇                                    │0.5840102855886486477    0.0340102855886486102 +- 0.0000000000000000069

I just wanted to understand the meaning of this. Does this mean that the fitting has already converged, and I can use the model parameter estimates? If yes, then why did the program keep running for 96 hours? If no, then how can I avoid this situation? By decreasing frac_remain as suggested in the output above?

PS: This doesn't happen every time I use the resume feature. In many cases, the fitting resumes okay, runs for several hours, and terminates successfully and normally.

JohannesBuchner commented 1 year ago

Yes, it looks like it thinks on the new run that the run converged. This is also subject to the random numbers that estimate the volume shrinkage in nested sampling.

Since your posterior uncertainties are zero or extremely small it looks like you have a problem with your likelihood being extremely spiked. Probably you are underfitting the data, i.e., the model is wrong.

Maybe add a term and parameter that adds extra model/data uncertainty. This should also help convergence speed.

JohannesBuchner commented 1 year ago

The "No changes made. Probably the strategy was to explore in the remainder, but it is irrelevant already; try decreasing frac_remain." can occur when max_num_improvement_loops is nonzero, and is safe to ignore. It means that the first run_iter call may have thought it would be a good idea to improve at a likelihood interval at the very end of the run, but the next run_iter call which executes this strategy sees when it arrives there that frac_remain is already reached and it is time to close the run (and not sample more points).

JohannesBuchner commented 1 year ago

The latest ultranest version is 3.5.7 btw

garvitagarwal290 commented 1 year ago

Hi. About your comment that we might be underfitting the data, actually, we have many cases where the model that is being fit is the same model that the data came from ie the model is the correct one. In these cases, we ought to get convergence, but we get the above behavior. Also, we have seen that if we use a subset of the time series (without changing the model) then this problem goes away; the convergence happens normally with posterior that are not delta-functions. We have also seen that for some time series the convergence happened normally, but upon modifying the uncertainties/noise of the data, we again start seeing this behavior. Can there be some other source to this problem than our model choice?

JohannesBuchner commented 1 year ago

I think you have to take a closer look at your likelihood function. Take one of your problematic cases, and find the point of maximum likelihood (probably also the maximum posterior). Then modify one of the values in very small steps. Your likelihood seems to be extremely sensitive to slight modifications away from that peak. So take a look inside your likelihood function what data points or uncertainties cause this extreme sensitivity.

JohannesBuchner / UltraNest

Upon running with 'resume', Ultranest terminates the run with the message 'No changes made. Probably the strategy was to explore in the remainder, but it is irrelevant already; try decreasing frac_remain.' #80

Description