Slow computation and sometimes biased results with warm starting

jacopok commented 5 days ago

UltraNest version: 4.3.4
Python version: 3.11
Operating System: Ubuntu 24.04

Summary

I have been experimenting with warm starting on a toy problem, and I found some strange behaviour. The sampling seems to reach the correct region quickly as expected, but then it takes a lot for it to converge to the right value of $Z$. Also, sometimes the estimate for $Z$ is biased (true value several sigmas from the estimate).

Description

The toy problem is: an $n$-dimensional Gaussian likelihood with mean $\mu = 0.5$ on every axis, a small $\sigma \sim 10^{-2}$. The prior transform is the identity for the unit cube. The evidence is then expected to be $Z \approx 1$ (to a very good approximation).

The point I'm making here also shows in 1D, but the script is versatile: it can run in higher dimensions if desired. The run times are reported for a 3D case, the trace plot is in 1D.

I am comparing a regular NS run to runs done with auxiliary likelihoods and priors obtained with get_auxiliary_contbox_parameterization, with the countours obtained from "guesses" as follows:

the correct posterior
the correct posterior with a reduced variance
the correct posterior with an increased variance

In the language of the SuperNest paper by Petrosyan and Handley, the KL divergence between the original prior and the posterior is (in the $\sigma \ll 1$ approximation) $$\mathcal{D}_{\pi}(\mathcal{P}) \approx -\frac{1}{2} (1+\log(2\pi)) - \log \sigma$$ per dimension, which comes out to about 3.2 nats for $\sigma=10^{-2}$.

With the guesses, on the other hand, we are going from a Gaussian modified prior with width $k \sigma$ to a Gaussian posterior with width $\sigma$, therefore (the same result as here but in nats) $$\mathcal{D}_{\tilde{\pi}}(\mathcal{P}) = \log k + \frac{1}{2} \left( \frac{1}{k^{2}} - 1 \right)$$

The examples I'm considering are $k = [0.5, 1, 2]$, with corresponding distances $[0.81, 0, 0.32]$ nats. The prior being equal to the posterior is a degenerate case, of course, but this still indicates that we should expect a good speed up!

Instead, when I run with the auxiliary sampler, the time performance is sometimes worse.

Also, although the evidence errors are indeed smaller (as they should be), in the case of a too-thin prior the evidence is underestimated (and the error is not correctly estimated).

Ultranest 4.3.4
Scenario: Regular sampling
t=29.16s
logZ=-0.11 +- 0.14
error: 0.78 sigmas

Scenario: Underestimated standard deviation
t=36.75s
logZ=-0.45 +- 0.05
error: 8.89 sigmas

Scenario: Correct standard deviation
t=35.25s
logZ=-0.02 +- 0.06
error: 0.32 sigmas

Scenario: Overestimated standard deviation
t=35.40s
logZ=0.02 +- 0.09
error: 0.25 sigmas

The script is as shown below.

This is what happens with frac_remain is set to a low number ($10^{-3}$) in all cases; if this is higher ($0.5$) things are closer to the expectations:

Ultranest 4.3.4
Scenario: Regular sampling
t=18.25s
logZ=-0.06 +- 0.43
error: 0.15 sigmas

Scenario: Underestimated standard deviation
t=3.96s
logZ=-0.56 +- 0.41
error: 1.37 sigmas

Scenario: Correct standard deviation
t=6.34s
logZ=-0.12 +- 0.41
error: 0.29 sigmas

Scenario: Overestimated standard deviation
t=9.72s
logZ=-0.06 +- 0.42
error: 0.15 sigmas

However, this does not clear up the issue: why does the sampler "get stuck" in the same region making very slow progress for the last contributions to the integral? Here is a trace plot for the same problem, with frac_remain=1e-3 but in 1D, in the correctly estimated standard deviation case. What is going on? Why are the points getting more spread out?

trace

import ultranest
import numpy as np
from ultranest.hotstart import get_auxiliary_contbox_parameterization, get_auxiliary_problem, get_extended_auxiliary_independent_problem
from time import perf_counter
import string
import matplotlib.pyplot as plt

N_live = 1000
N_post = 1000
N_par = 3
scale = 1e-2
frac_remain=1e-3

param_names = [string.ascii_lowercase[i] for i in range(N_par)]

def loglike(par):
    return -0.5 * np.sum((par-0.5)**2, axis=1) / scale**2 - N_par / 2 * np.log(2*np.pi*scale**2)

def prior_transform(cube):
    # flat prior in [0, 1]^n
    return cube

run_times = []
z_estimates = []
z_errors = []

sampler = ultranest.ReactiveNestedSampler(
    param_names, 
    loglike, 
    prior_transform,
    log_dir="regular_sampling", 
    resume='overwrite', 
    vectorized=True
)

t1 = perf_counter()
result = sampler.run(min_num_live_points=N_live, frac_remain=frac_remain)
t2 = perf_counter()

sampler.print_results()

sampler.plot_trace()
plt.savefig('regular_sampling.png')
plt.close()

z_estimates.append(result['logz'])
z_errors.append(result['logzerr'])

run_times.append(t2-t1)

rng = np.random.default_rng(1)

for scale_factor in [0.5, 1, 2.]:

    simulated_posterior = rng.normal(
        loc=0.5*np.ones(N_par), 
        scale=scale*scale_factor, 
        size=(N_post, N_par)
    )

    aux_param_names, aux_loglike, aux_transform, vectorized = get_auxiliary_contbox_parameterization(
            param_names, 
            loglike, 
            prior_transform, 
            simulated_posterior, 
            np.ones(N_post) / N_post,
            vectorized=True,
        )

    sampler_aux = ultranest.ReactiveNestedSampler(
        aux_param_names, 
        aux_loglike, 
        aux_transform,
        log_dir=f"aux_sampling_{scale_factor}", 
        resume='overwrite', 
        vectorized=True
    )

    t3 = perf_counter()
    result_aux = sampler_aux.run(min_num_live_points=N_live, frac_remain=frac_remain)
    t4 = perf_counter()
    run_times.append(t4-t3)

    sampler_aux.print_results()
    sampler_aux.plot_trace()
    z_estimates.append(result_aux['logz'])
    z_errors.append(result_aux['logzerr'])

scenarios = [
    'Regular sampling',
    'Underestimated standard deviation',
    'Correct standard deviation',
    'Overestimated standard deviation',
]

print(f'Ultranest {ultranest.__version__}')

for i in range(4):
    print(f'Scenario: {scenarios[i]}')
    print(f't={run_times[i]:.2f}s')
    print(f'logZ={z_estimates[i]:.2f} +- {z_errors[i]:.2f}')
    print(f'error: {abs(z_estimates[i])/z_errors[i]:.2f} sigmas\n')

jacopok commented 5 days ago

If it will be necessary to make changes to address this I am happy to work on them in a PR!

JohannesBuchner commented 5 days ago

Yeah, I found something similar, but am not sure why it occurs.

The code is here: https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/hotstart.py#L346 which sets up the auxiliary transform functions. Maybe you can take this function apart?

Maybe the compute_quantile_intervals_refined function needs some explanation, it creates a sequence of intervals with their weight. This is how the parameter space is squeezed. In principle, any such interval and weight should be give the same result, as the squeezing and reweighting should cancel each other. Obviously, it does not seem to be like that in practice.

JohannesBuchner commented 5 days ago

The efficiency can go down if the parameter space squeezing then creates holes and thus all the corners (in high dimensions there are many) need to be explored. Typically, elliptical likelihood contours are most efficient to sample.

jacopok commented 5 days ago

I'll delve into it! If you have any notes written down on the details of the mathematical formulation of the coordinate change it'd be useful, else I'll just reverse-engineer it from the code.

JohannesBuchner commented 3 days ago

It is just that a zoom-interval from xlo to xhi is downweighted by 1/(xhi - xlo)

A list of intervals is first derived from marginal quantiles.

An auxiliary variable t is introduced to interpolate a interval of interest.

My first attempt used a student-t distribution (get_extended_auxiliary_problem), but that also showed issues.

jacopok commented 19 hours ago

I think I understand the issue now.

This is what the transformation looks like, as a function of the coordinates $v$ and $t$: it "expands" the region of interest as it should.

Why is sampling from this not faster? Plotting the loglikelihood shows the problem:

Due to the aux volume correction, the maximum likelihood is attained for $t=1$: therefore, nested sampling will quickly reach the region of small $t$ and moderately high likelihood, but then it has to iterate until it finds the peak at $t=1$, meaning it still needs to reach the same level of prior compression as in the original problem, negating the benefits of the transformation!

I think this can be solved by implementing a fixed prior transformation with no auxiliary variable, something like this: New_transformation

A spline representation of the downsampled CDF of the posterior guess is easy to sample from and differentiate to compute the volume compression.

I am working out the details (e.g. I'd like to ensure the CDF's slope is never lower than the slope outside the "compression region") and will make a PR shortly.

jacopok commented 18 hours ago

Preliminary results, sampling a Gaussian with $$\sigma = 10^{-6}$$ along each axis in 6D, 200 live points, frac_remain=1e-2:

Ultranest 4.3.4
Scenario: Regular sampling
t=88.10s
logZ=-0.00 +- 0.64
error: 0.00 sigmas

Scenario: Correct standard deviation
t=8.46s
logZ=-0.01 +- 0.10
error: 0.09 sigmas

Scenario: Overestimated standard deviation
t=4.97s
logZ=0.11 +- 0.14
error: 0.82 sigmas

Scenario: Way overestimated standard deviation
t=7.31s
logZ=0.13 +- 0.33
error: 0.39 sigmas

JohannesBuchner commented 17 hours ago

Thanks for looking into this.

My thought was that even with a "wrong" transformation, the sampler could zoom out and back out to the original prior. But yes, the transform should ideally be built a bit broader than the expected posterior, so nested sampling is actually zooming in (which is efficient), and does not need to zoom out and navigate the difficult funnel. That is, the modified likelihood should monotonically rise to the peak like the likelihood.

The current compute_quantile_intervals_refined function doesn't do a good job of that.

In addition, there is the problem that parameter space is actually cut off at the moment by the zooming, which can lose posterior mass. Probably the transform should have 1% of the original full prior space and 99% of the zoomed space, or something like that.

jacopok commented 17 hours ago

Thanks for looking into this.

My thought was that even with a "wrong" transformation, the sampler could zoom out and back out to the original prior. But yes, the transform should ideally be built a bit broader than the expected posterior, so nested sampling is actually zooming in (which is efficient), and does not need to zoom out and navigate the difficult funnel. That is, the modified likelihood should monotonically rise to the peak like the likelihood.

I agree with these considerations, but I do not think that is the problem with the current implementation: inference is slow regardless of the breadth of the transform since the aux_likelihood always peaks at (its untransformed peak, t=1), and the prior volume compression needed to reach that area is the same as the original problem.

The current compute_quantile_intervals_refined function doesn't do a good job of that.

In addition, there is the problem that parameter space is actually cut off at the moment by the zooming, which can lose posterior mass. Probably the transform should have 1% of the original full prior space and 99% of the zoomed space, or something like that.

Indeed, in the implementation I'm drafting which is sketched in the figure in my previous comment some space is reserved to the original prior - I think even 50% is fine, since it should just take $O(n_{\text{live}} \log 2)$ iterations to traverse which is not that many.

JohannesBuchner commented 14 hours ago

I am also a bit worried about the rectangles, as they introduce many corners, and therefore the modified likelihood contours becomes non-elliptical and highly inefficient to sample. Would applying your curve in a spherically symmetric way be feasible? Applying an affine transformation could consider the scale in each variable -- I think the code for this is already there for the student-t case.

jacopok commented 13 hours ago

I agree that corners are to be avoided, and indeed I iterated towards a smoother parameterization - with the interpolator I'm using now in the PR, the derivative of the deformation map (whose log is shown as a dashed line) is constrained to be smooth. I'm thinking of implementing some refinement algorithm to start from the full guess CDF sampled at all the posterior points, and remove sampling points (greedily?) until the transform is "smooth enough".

I don't quite understand the points about symmetry and the multivariate t-student distribution. The approach I'm proposing is essentially non-parametric: the marginal posterior guesses are allowed to have skewness or even be multimodal - this seems like a good feature to have, as long as the transform remains smooth enough, right?

JohannesBuchner / UltraNest

Slow computation and sometimes biased results with warm starting #155

Summary

Description