Open jacopok opened 5 days ago
If it will be necessary to make changes to address this I am happy to work on them in a PR!
Yeah, I found something similar, but am not sure why it occurs.
The code is here: https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/hotstart.py#L346 which sets up the auxiliary transform functions. Maybe you can take this function apart?
Maybe the compute_quantile_intervals_refined
function needs some explanation, it creates a sequence of intervals with their weight. This is how the parameter space is squeezed. In principle, any such interval and weight should be give the same result, as the squeezing and reweighting should cancel each other. Obviously, it does not seem to be like that in practice.
The efficiency can go down if the parameter space squeezing then creates holes and thus all the corners (in high dimensions there are many) need to be explored. Typically, elliptical likelihood contours are most efficient to sample.
I'll delve into it! If you have any notes written down on the details of the mathematical formulation of the coordinate change it'd be useful, else I'll just reverse-engineer it from the code.
It is just that a zoom-interval from xlo to xhi is downweighted by 1/(xhi - xlo)
A list of intervals is first derived from marginal quantiles.
An auxiliary variable t is introduced to interpolate a interval of interest.
My first attempt used a student-t distribution (get_extended_auxiliary_problem
), but that also showed issues.
I think I understand the issue now.
This is what the transformation looks like, as a function of the coordinates $v$ and $t$: it "expands" the region of interest as it should.
Why is sampling from this not faster? Plotting the loglikelihood shows the problem:
Due to the aux volume correction, the maximum likelihood is attained for $t=1$: therefore, nested sampling will quickly reach the region of small $t$ and moderately high likelihood, but then it has to iterate until it finds the peak at $t=1$, meaning it still needs to reach the same level of prior compression as in the original problem, negating the benefits of the transformation!
I think this can be solved by implementing a fixed prior transformation with no auxiliary variable, something like this:
A spline representation of the downsampled CDF of the posterior guess is easy to sample from and differentiate to compute the volume compression.
I am working out the details (e.g. I'd like to ensure the CDF's slope is never lower than the slope outside the "compression region") and will make a PR shortly.
Preliminary results, sampling a Gaussian with $$\sigma = 10^{-6}$$ along each axis in 6D, 200 live points, frac_remain=1e-2
:
Ultranest 4.3.4
Scenario: Regular sampling
t=88.10s
logZ=-0.00 +- 0.64
error: 0.00 sigmas
Scenario: Correct standard deviation
t=8.46s
logZ=-0.01 +- 0.10
error: 0.09 sigmas
Scenario: Overestimated standard deviation
t=4.97s
logZ=0.11 +- 0.14
error: 0.82 sigmas
Scenario: Way overestimated standard deviation
t=7.31s
logZ=0.13 +- 0.33
error: 0.39 sigmas
Thanks for looking into this.
My thought was that even with a "wrong" transformation, the sampler could zoom out and back out to the original prior. But yes, the transform should ideally be built a bit broader than the expected posterior, so nested sampling is actually zooming in (which is efficient), and does not need to zoom out and navigate the difficult funnel. That is, the modified likelihood should monotonically rise to the peak like the likelihood.
The current compute_quantile_intervals_refined
function doesn't do a good job of that.
In addition, there is the problem that parameter space is actually cut off at the moment by the zooming, which can lose posterior mass. Probably the transform should have 1% of the original full prior space and 99% of the zoomed space, or something like that.
Thanks for looking into this.
My thought was that even with a "wrong" transformation, the sampler could zoom out and back out to the original prior. But yes, the transform should ideally be built a bit broader than the expected posterior, so nested sampling is actually zooming in (which is efficient), and does not need to zoom out and navigate the difficult funnel. That is, the modified likelihood should monotonically rise to the peak like the likelihood.
I agree with these considerations, but I do not think that is the problem with the current implementation: inference is slow regardless of the breadth of the transform since the aux_likelihood always peaks at (its untransformed peak, t=1), and the prior volume compression needed to reach that area is the same as the original problem.
The current
compute_quantile_intervals_refined
function doesn't do a good job of that.In addition, there is the problem that parameter space is actually cut off at the moment by the zooming, which can lose posterior mass. Probably the transform should have 1% of the original full prior space and 99% of the zoomed space, or something like that.
Indeed, in the implementation I'm drafting which is sketched in the figure in my previous comment some space is reserved to the original prior - I think even 50% is fine, since it should just take $O(n_{\text{live}} \log 2)$ iterations to traverse which is not that many.
I am also a bit worried about the rectangles, as they introduce many corners, and therefore the modified likelihood contours becomes non-elliptical and highly inefficient to sample. Would applying your curve in a spherically symmetric way be feasible? Applying an affine transformation could consider the scale in each variable -- I think the code for this is already there for the student-t case.
I agree that corners are to be avoided, and indeed I iterated towards a smoother parameterization - with the interpolator I'm using now in the PR, the derivative of the deformation map (whose log is shown as a dashed line) is constrained to be smooth. I'm thinking of implementing some refinement algorithm to start from the full guess CDF sampled at all the posterior points, and remove sampling points (greedily?) until the transform is "smooth enough".
I don't quite understand the points about symmetry and the multivariate t-student distribution. The approach I'm proposing is essentially non-parametric: the marginal posterior guesses are allowed to have skewness or even be multimodal - this seems like a good feature to have, as long as the transform remains smooth enough, right?
Summary
I have been experimenting with warm starting on a toy problem, and I found some strange behaviour. The sampling seems to reach the correct region quickly as expected, but then it takes a lot for it to converge to the right value of $Z$. Also, sometimes the estimate for $Z$ is biased (true value several sigmas from the estimate).
Description
The toy problem is: an $n$-dimensional Gaussian likelihood with mean $\mu = 0.5$ on every axis, a small $\sigma \sim 10^{-2}$. The prior transform is the identity for the unit cube. The evidence is then expected to be $Z \approx 1$ (to a very good approximation).
The point I'm making here also shows in 1D, but the script is versatile: it can run in higher dimensions if desired. The run times are reported for a 3D case, the trace plot is in 1D.
I am comparing a regular NS run to runs done with auxiliary likelihoods and priors obtained with
get_auxiliary_contbox_parameterization
, with the countours obtained from "guesses" as follows:In the language of the
SuperNest
paper by Petrosyan and Handley, the KL divergence between the original prior and the posterior is (in the $\sigma \ll 1$ approximation) $$\mathcal{D}_{\pi}(\mathcal{P}) \approx -\frac{1}{2} (1+\log(2\pi)) - \log \sigma$$ per dimension, which comes out to about 3.2 nats for $\sigma=10^{-2}$.With the guesses, on the other hand, we are going from a Gaussian modified prior with width $k \sigma$ to a Gaussian posterior with width $\sigma$, therefore (the same result as here but in nats) $$\mathcal{D}_{\tilde{\pi}}(\mathcal{P}) = \log k + \frac{1}{2} \left( \frac{1}{k^{2}} - 1 \right)$$
The examples I'm considering are $k = [0.5, 1, 2]$, with corresponding distances $[0.81, 0, 0.32]$ nats. The prior being equal to the posterior is a degenerate case, of course, but this still indicates that we should expect a good speed up!
Instead, when I run with the auxiliary sampler, the time performance is sometimes worse.
Also, although the evidence errors are indeed smaller (as they should be), in the case of a too-thin prior the evidence is underestimated (and the error is not correctly estimated).
The script is as shown below.
This is what happens with
frac_remain
is set to a low number ($10^{-3}$) in all cases; if this is higher ($0.5$) things are closer to the expectations:However, this does not clear up the issue: why does the sampler "get stuck" in the same region making very slow progress for the last contributions to the integral? Here is a trace plot for the same problem, with
frac_remain=1e-3
but in 1D, in the correctly estimated standard deviation case. What is going on? Why are the points getting more spread out?