4.1.6 performs worse - Githubissues

Joshuaalbert commented 5 months ago

@JohannesBuchner doing cross comparison with mutliple NS implementations I found that upgrading to 4.1.6 from 3.5 decreased performance, and introduced strange performance scaling, e.g. max_num_improvement_loops=0 sometimes more efficient that max_num_improvement_loops=-1, and sometimes this relation switches as min_num_live_points increases. Also, Static performed better than reactive in 3.5. Not sure if these anecdotal remarks are helpful, but would be happy to call if you think that is useful.

JohannesBuchner commented 5 months ago

Thank you for this very useful feedback. I guess you are running without a step sampler. The main change that occurred there is the introduction of MaxPrincipleGapAffineLayer. Can you please try replacing MaxPrincipleGapAffineLayer with AffineLayer in ultranest/integrator.py and see if that is the change that makes the difference? Are you looking at one inference problem in particular, or several?

Joshuaalbert commented 5 months ago

My pleasure. I don't have access to that setting from enduser state. We're integrating ultranest into bilby, and then running on a set of problems (https://github.com/Joshuaalbert/jaxns-cosmology/blob/main/jaxns_cosmology/models/__init__.py). The integration is here: https://github.com/Joshuaalbert/jaxns-cosmology/blob/main/jaxns_cosmology/ultranest_sampler.py#L219. I inherited this implementation from Roberto Ruiz and touched it up in response to your advice. The grid of hyper-parameters searched over for ultranest is here: https://github.com/Joshuaalbert/jaxns-cosmology/blob/main/experiment/main.py#L124

JohannesBuchner commented 5 months ago

Can you do the following?

sampler = ReactiveNestedSampler(...)
sampler.transform_layer_class = ultranest.mlfriends.AffineLayer

Joshuaalbert commented 5 months ago

I'll try that out.

Joshuaalbert commented 5 months ago

I see more likelihood evaluations are used when max_num_improvement_loops=-1.

{
  "model_name": "CMB",
  "params": {
    "min_num_live_points": 150,
    "max_ncalls": 50000000,
    "max_num_improvement_loops": -1
  },
  "likelihood_evals": 156466
}

{
  "model_name": "CMB",
  "params": {
    "min_num_live_points": 150,
    "max_ncalls": 50000000,
    "max_num_improvement_loops": 0
  },
  "likelihood_evals": 63019
}

JohannesBuchner commented 5 months ago

Well, you are supposed to get a refined ln(Z) estimate and posterior distribution too with higher evals.

JohannesBuchner commented 5 months ago

For a fair comparison across samplers you should compare cost for the same quality, which may be ln(Z) scatter around the true value, or distribution of the posterior z-score at the true input parameter value (GW papers do this a lot these days).

Joshuaalbert commented 5 months ago

It's not an evidence comparison that is being done, but rather a posterior inference test. Also, we're assuming the default stopping conditions for each sampler, as they are all different.

JohannesBuchner commented 5 months ago

I think what you see is expected: you should see higher ESS for higher computational cost runs, and a posterior comparison across samplers should consider this.

While max_num_improvement_loops!=0 with a small number of nlive could dynamically improve the run where needed, I often run with max_num_improvement_loops=0 and a decent number of nlive because having more live points in each iteration is directly related to the ln(Z) errors and is more robust to explore (multiple) modes than widening nlive only at the posterior bulk.

Your likelihood setup makes sense to me to compare algorithmic performance. However, I do want to point out that UltraNest could run much faster in terms of wall clock time if you provide a vectorized likelihood function (and regulate with the ndraw_min, ndraw_max parameters). This would be quite efficient with JAX-based likelihoods, and we use this very effectively for neural network emulator-based fitting models.

Please let me know if the MaxPrincipleGapAffineLayer is problematic here and for which of the five problems, because then I would change the default.

Joshuaalbert commented 5 months ago

AffineLayer grid search is almost over, but I can say this already:

Model	AffineLayer	MaxPrincipleGapAffineLayer
CMB	Works	Works
EggBox10D	Fails*	Fails*
MSSM7	Fails* Sometimes	Fails*
Rosenbrock10D	Fails*	Fails*
SpikeSlab10D	Fails to converge	Fails to converge

Each model gets tried over a grid of hyper-parameters, and we're looking for visual quality and completeness of corner plots to assess convergence. We are generous in that noisy but correct shape still means converged.

Fails* = Does not complete within 1hour or 50M likelihood evaluations Fails to converge = Does not converge to the correct solution Works = In all tried hyper-parameter settings finished and converged

A bit about our method:

We are assessing the typical path of a scientist who would want to use NS as part of their analysis, meaning that we designed a comparison protocol that factors in the typical constraints of the average scientist: Only exploring the most easily understood hyper-parameters, leaving the rest default including stopping conditions, setting computational resource limits in terms of wall time and likelihood evaluations (to similar expensive likelihoods translating to wall times). We established the grid searches and computational limits ahead of time, and then ran everything. Then for fairness of comparison, when an NS implementation has some issue we contact the author, and make an effort to make a change (within the parameters of the protocol, e.g. small hyper-parameter changes, version updates).

At this point it looks like we've got the most out of ultranest within this protocol, and I'll take the best I've been able to attain (using latest version and AffineLayer). I searched over min_num_live_points in {25D, 50D, 100D} and max_num_improvement_loops in {0, -1}.

JohannesBuchner commented 5 months ago

Thank you for sharing these results. Just for clarification, are the results for 3.5 better (more often completed?)

Joshuaalbert commented 5 months ago

It's the same to current version with AffineLayer. Though I also searched over Reactive and Static samplers, and the runs that I saved used the fewest number of live points and that was the Static sampler. But, I do recall that the reactive sampler converged on the same runs, just using more evals. So I am fairly confidence that 3.5's reactive sampler was the same-ish as 4.1.6 with AffineLayer.

JohannesBuchner commented 5 months ago

Please be sure to label the results in your comparison as "MLFriends" or "UltraNest/MLFriends", as that is the algorithm used here. UltraNest can use several algorithms, including slice sampling, which is better at high dimensions and difficult curvatures.

I would recommend also adding a ultranest configuration with a step sampler as laid out in the ultranest tutorials. A setup I now often use is:

sampler.run(max_ncalls=50000, **run_kwargs)
sampler.stepsampler = ultranest.stepsampler.SliceSampler(
    nsteps=3*ndim, generate_direction=ultranest.stepsampler.generate_mixture_random_direction,
    adaptive_nsteps='move-distance', region_filter=False)
sampler.run(**run_kwargs)

For eggbox: In 2d (18 modes), MLFriends is sampling this very efficiently. In 3d (more than 100 modes), a reactive run realises it needs more live points per cluster. But I am not surprised that in 10d, which is extremely highly multimodal, the shape is very inefficiently explored with the very cautious algorithm. The spatial clustering won't be able to break apart the clusters. I am not sure for which astrophysical application millions of modes is relevant. Extending a 2d eggbox with a gaussian or other monomodal distribution in other dimensions to a 10d problem seems more reasonable to me.

For spikeslab, I can reproduce that 400 live points finds the slab but misses the spike. For 2000 live points, the spike is found in 4, 6, 8 dimensions, but not in 10. If you set a maximum number of likelihood calls, then the spike may be noticed but its full weight may not be realised by sampling to its peak; ultranest will give you what it knows so far.

For rosenbrock, I can also reproduce that MLFriends is inefficient. I think what happens here is that it realises that the likelihood contour is not ellipsoidal, and then proceed very cautiously. The efficiency at the same dimension with a Gaussian would be much higher.

Joshuaalbert commented 5 months ago

I willOn Mar 7, 2024 13:55, Johannes Buchner @.*> wrote: Please be sure to label the results in your comparison as "MLFriends" or "UltraNest/MLFriends", as that is the algorithm used here. UltraNest can use several algorithms, including slice sampling, which is better at high dimensions and difficult curvatures. I would recommend also adding a ultranest configuration with a step sampler as laid out in the ultranest tutorials. A setup I now often use is: sampler.run(max_ncalls=50000, *run_kwargs) sampler.stepsampler = ultranest.stepsampler.SliceSampler( nsteps=3ndim, generate_direction=ultranest.stepsampler.generate_mixture_random_direction, adaptive_nsteps='move-distance', region_filter=False) sampler.run(run_kwargs)

For eggbox: In 2d (18 modes), MLFriends is sampling this very efficiently. In 3d (more than 100 modes), a reactive run realises it needs more live points per cluster. But I am not surprised that in 10d, which is extremely highly multimodal, the shape is very inefficiently explored with the very cautious algorithm. The spatial clustering won't be able to break apart the clusters. I am not sure for which astrophysical application millions of modes is relevant. Extending a 2d eggbox with a gaussian or other monomodal distribution in other dimensions to a 10d problem seems more reasonable to me. For spikeslab, I can reproduce that 400 live points finds the slab but misses the spike. For 2000 live points, the spike is found in 4, 6, 8 dimensions, but not in 10. If you set a maximum number of likelihood calls, then the spike may be noticed but its full weight may not be realised by sampling to its peak; ultranest will give you what it knows so far. For rosenbrock, I can also reproduce that MLFriends is inefficient. I think what happens here is that it realises that the likelihood contour is not ellipsoidal, and then proceed very cautiously. The efficiency at the same dimension with a Gaussian would be much higher.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

JohannesBuchner commented 5 months ago

The performance regression should be resolved with 4.2.0.

For details, see https://github.com/JohannesBuchner/UltraNest/commit/cd5b02ee5cb14af38fd10c3633809338e9272dbc

JohannesBuchner commented 3 months ago

Please reopen if this issue still exists.

Joshuaalbert commented 3 months ago

@JohannesBuchner which citation would you like us to use for UltraNest/MLFriends?

JohannesBuchner commented 3 months ago

see https://johannesbuchner.github.io/UltraNest/issues.html#how-should-i-cite-ultranest

JohannesBuchner / UltraNest

4.1.6 performs worse #124