JohannesBuchner / UltraNest

Fit and compare complex models reliably and rapidly. Advanced nested sampling.
https://johannesbuchner.github.io/UltraNest/
Other
155 stars 30 forks source link

[Question] Popstepsampler: Number of walkers and live points #97

Closed njzifjoiez closed 1 year ago

njzifjoiez commented 1 year ago

Description

First, thanks a lot for the amazing piece of code you put in place. I am currently playing with the PopulationSliceSampler as I have a vectorized likelihood written in JAX, and I would like to understand a bit more the parameters of the sampler. One issue with the likelihood in JAX is that I use the JIT, and thus the number of likelihoods that I can compute at a time should be fixed. If not, the code will re-compile it.

I fix the popsize to that fixed number of likelihood, then the sampler is always giving less likelihood to compute as some points are rejected. I don't know if there is a way to tweak that. For now, my solution is to have a popsize a couple of times larger than the number of likelihood (like ten times), so even if I still have to pad the number of set of parameters, I am wasting fewer evaluations than if I fix popsize to the number of likelihoods.

The number of likelihood that I am computing at the same time is on the order of number of live points that I need, so I am wondering how the popsize should be compared to the number of live points, if it is preferable to be inferior like it was proposed on issue #57 or if my solution would make sense.

What I Did

Here is what I run for now, just for test purposes:

sampler = ReactiveNestedSampler(list(paramnames),loglike, transform=lambda x: x,log_dir='Ultranest_slice',vectorized=True,num_test_samples=nlive)
sampler.stepsampler = ultrapop.PopulationSliceSampler(popsize=10*nlive, nsteps=2*len(paramnames), generate_direction=ultrapop.generate_cube_oriented_direction)
res=sampler.run(dlogz=0.5 + 0.1*len(paramnames) ,min_num_live_points=nlive,max_iters=max_it)

Where nlive is the number of likelihood I am computing at the same time.

JohannesBuchner commented 1 year ago

Sounds reasonable to me, I guess you could also play with a combination of padding and looping.

I think you should try a few runs with varying popsize and nsteps to test, because there is a subtlety with the PopulationSliceSampler: Because the number of evaluations until the expanding to each side is complete and until the shrinking is completed is unpredictable, the number of iterations until a sample has reached nsteps is out of sync across the population. The current implementation of PopulationSliceSampler tries to keep track of that and return the samples in order, as they would have returned if a simple, single slice sampler has been run. This is not quite the same, because the likelihood threshold is evolving. I think there are still some lingering issues that need to be debugged, and your help would be appreciated.

I think one could write another PopulationSliceSampler, but much simpler, based on a Gaussian Random Walk Metropolis proposal. That case is much simpler, because

njzifjoiez commented 1 year ago

Yes, I also have a looping when the number of likelihoods to compute is higher than the fixed size.

For sure, I could help in debugging that. I planned to make different tests of popsize and nsteps to see what configuration would work best. Do you have quantities you would track to see that everything works as expected? I have to admit, I don't think I fully understand how the sampler is working currently.

I am interested in all these vectorized samplers for my current project and some future ones to exploit some computing power more efficiently. If I understand your idea well, it would not be a slice sampler anymore, right? Just a population of walkers having a Gaussian random walk for a certain number of steps? Do you think it could be improved for higher dimensions by adding some kind of parallel tempering based on the number of steps of the walkers?

JohannesBuchner commented 1 year ago

I created release (3.6.0), which includes PopulationRandomWalkSampler. Here is an example for how to use it: https://github.com/JohannesBuchner/UltraNest/blob/master/examples/test_popsampler.py#L43

Please test it. Here are some hints:

JohannesBuchner commented 1 year ago

To calibrate, try several runs with increasing nsteps. The ln(Z) should become stable at some value. Also observe the scale from the logfile (logfile=open('myfilename.log), 'w')).

JohannesBuchner commented 1 year ago

Documentation links for step samplers:

You will still need num_test_samples in the UltraNest initialisation.

This screen recording demonstrates 100000 model evaluations per second(!) on a 50d rosenbrock function: https://asciinema.org/a/AUZnV4yf21ffddKGtoxdH2IgA

njzifjoiez commented 1 year ago

Thanks a lot for this new sampler! I am currently testing it, and it's blazing fast compared to the slice sampler!

lwelzel commented 1 year ago

I'm having a similar problem; where I have a vectorized likelihood function with a ReactiveNestedSampler using the PopulationSliceSampler. I need to keep the number of simultaneous likelihood function strictly below (or at) a specific number to not use more GPU memory than available. In my case this means a batch of ~2048. Is there a setting that enforces this?

From what I understand @njzifjoiez batches the function calls in the likelihood function and then returns the concatenated likelihoods. This seems robust but significantly slows down the average number of sampled points per second compared to the likelihood function that doesn't use batching.*

    def wrapper_log_liklihood(theta):
        theta = torch.tensor(theta.astype("float32"))
        results = torch.zeros(theta.shape[0])
        for i, t in enumerate(torch.split(theta, 2048)): # split into batches of 2048 samples at a time
            results[int(i * 2048):int(i * 2048 + 2048)] = log_likelihood(t)
        return results.cpu().numpy().astype("float64")

To get around this and keep my likelihood function not batched I tried limiting the simultaneous function calls with the following settings:

ReactiveNestedSampler: ndraw_max=2048 PopulationSliceSampler: popsize=2048 run() method: min_num_live_points=2048

The widen_before_initial_plateau_num_warn and widen_before_initial_plateau_num_max options are not implemented for this configuration but the root widening functions can still called, see below. This calls the likelihood function with a larger number of simultaneously evaluated live points than I thought should be possible with the settings.

# ...
[ultranest] Widening roots to **37568** live points (have 3163 already) ... # (I set the required dlogz very low for this)
[ultranest] Sampling **29924** live points from prior ...
Traceback (most recent call last):
  File
  "/home/.../ultranest_fitting.py", line
  242, in < module >
  sampler.run(
      File
  "/home/.../site-packages/ultranest/integrator.py", line
  2316, in run
  for result in self.run_iter(
          File "/home/.../site-packages/ultranest/integrator.py", line 2752, in run_iter
  self._widen_roots_beyond_initial_plateau(self.min_num_live_points)
  File "/home/.../site-packages/ultranest/integrator.py", line 1401, in _widen_roots_beyond_initial_plateau
  self._widen_roots(nroots_needed)
  File "/home/.../site-packages/ultranest/integrator.py", line 1470, in _widen_roots
  active_logl = self.loglike(active_v)
  File "/home/.../ultranest_fitting.py", line 213, in wrapper_log_liklihood
  out = log_likelihood(theta)
  File "/home/.../site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  return forward_call(*args, **kwargs)
  # OOM error from GPU

Do I understand your previous reply @JohannesBuchner correctly that the Gaussian Random Walk sampler is (memory) safer in this case, in that the number of simultaneous likelihood function evaluations is always constant? Is there a different way, next to batching, to split up the evaluations for the slice sampling in a way that keeps the function calls always below a specific threshold?

* I sadly am not very familiar with nested sampling, but I can imagine that this could be due to the higher CPU load between each stack of evaluations as the integrator has to deal with many more live points.

JohannesBuchner commented 1 year ago

try avoiding a plateau, i.e., returning exactly the same low value, altogether.

see https://github.com/JohannesBuchner/UltraNest/issues/81#issuecomment-1571513539

JohannesBuchner commented 1 year ago

At the start of nested sampling it samples parameters from the prior directly, not via PopulationSliceSampler.

JohannesBuchner commented 1 year ago

I am closing this for now, please reopen if there is more that should be done.

njzifjoiez commented 9 months ago

I am writing this here as it is connected to our discussion, but let me know what would be the best way to proceed. I thought of an implementation of a kind of slice sampler that would call the likelihoods only on a fixed-size batch, and I thought you might be interested in it. I have forked the code and implemented it in the branch fixed-size-vectorised-slice-sampler here: https://github.com/njzifjoiez/UltraNest

It is kinda inspired by the elliptical slice sampler presented here (https://ui.adsabs.harvard.edu/abs/2010arXiv1001.0175M/abstract), in particular in how the slice is shrunk. For a single point, the idea would be to select a direction and define the slice as a segment of size 2*scale*jitter_scale aligned with the direction and with the initial point as a centre. As we are here on the unit cube, the size of the slice will also be limited by the bound of the space. In a 1D space, the interval will look like [x-2*scale*jitter_scale,x+2*scale*jitter_scale]

To find the next point, it samples a new point of the slice uniformly; if the new points x' fulfil the likelihood threshold, it is taken; otherwise, the interval shrinks toward the initial point. If x>x', then the 1D interval is [x-2*scale*jitter_scale,x'], and if x<x', we get the opposite [x',x+2*scale*jitter_scale]. It shrinks until it finds the next points (I add a limit to that search to avoid an infinite loop, but technically, it can always fall back on the initial point). So, the shrinking process is similar to that of the elliptical slice sampler as they also shrink similarly to the initial points but on an ellipse instead of a straight line.

The advantage of shrinking in that way is that we can precompute some likelihood points that could be valid for more than one step in advance. In that case, the idea is to sample more than one point at a time on the slice and read them in a specific order like they were selected after the other. Let's say we are sampling 2 points instead of one. We can do the first step as before, and for the second step, we have to check if the second point is in the new interval defined by the first point. If not, it is just discarded. If it is, we just do another step of the sampler. It should be the same as sampling uniformly new points, as when we restrict the uniform sample on the initial interval to the new one, the sub-sample is also uniformly sampled there.

Ideally, we don't want to have too many likelihood calls per point to move because we may waste computing resources. So we start with one likelihood call per point, and as we find new points, it re-assigns more calls to the points that haven't moved yet to explore the rest of their associated slice faster.

Regarding the implementation, there is the sampler in popstepsampler.py called PopulationEllipticalSliceSampler and there is a new cython function to perform the advance of the sampler without the python overhead (update_vectorised_slice_sampler). The name of the sampler is badly chosen, but it was to differentiate it from the one already implemented.

I began testing on problems defined in the examples in the code (Gaussian, multishell, eggbox, funnel, rosenbrock) against the slice sampler and the already implemented population slice sampler (examples/test_PopEllSliceSampler.py). It looks quite promising, especially in terms of its sampling speed. My preliminary results on low dimensional problems d<=50 looks to agree with the original slice sampler in terms of the Log Z except for the Gaussian, for which my sampler is providing higher results, with a difference of like 5-10 in the tested dimension size. Usually, when it fails, the population slice sampler also fails with the expense of way more running time and likelihood calls. I would like to perform some tests to see its robustness, but I am unsure what to look at. For now, I look at the logZ, the insert order U test, the run time and the number of likelihood call, but maybe there are more adapted metrics to do that.

Also those test case seem limited by the sampler speed and not the likelihood so I am unsure how much it would change in the other case, it would depends on gain obtained by the vectorisation because it may need more likelihood call than the normal slice sampler for example.

JohannesBuchner commented 9 months ago

It would be best to open a pull request to discuss this. The name is indeed not fitting, because it is a slice sampler (https://arxiv.org/abs/physics/0009028), not Elliptical Slice Sampling (which is restricted to a Gaussian prior btw).

Maybe when you open it you can already answer the following two questions from a quick glance at the code:

I do not quite understand why you can sample the likelihood threshold (within nested sampling's likelihood-restricted prior sampling it is already fixed, and the target to sample is flat).

I think you are avoiding the stepping out procedure, which gives computational speed-up. But I am not sure whether detailed balance is preserved with a random slice width?