Very high-dimension sampling with decreasing number of live points

gdesvignes commented 2 years ago

UltraNest version: 3.3.3
Python version: 3.10.2
Operating System: CentOS 7.9.2009

Description

I'm trying to model a dataset of a few thousand points and I wrote a likelihood function for GPU in CUDA that uses single precision float. The model has 211 parameters so I'm using a step sampler as shown below with over 2000 live points. Using less live points make the code apparently detect lots of cluster whereas the posterior should be mono-modal. Throughout the run, the number of live points slowly decreases as a large number of plateaux are supposedly found. Until after several hours and a well advanced sampling, only ~350 live points remain, clusters are detected and the code eventually crashes.

Is there a way to force the number of live points to remain fix? Or am I missing something else? Any feedback would be appreciated.

What I Did

sampler = ReactiveNestedSampler(paramnames, model.LogLikelihood_gpu, transform=model.Prior, vectorized=True, log_dir=".", num_test_samples=0, ndraw_min=512)
sampler.stepsampler = stepsampler.RegionSliceSampler(nsteps=200, adaptive_nsteps='move-distance')
sampler.run(min_num_live_points=2048, viz_callback=False)

Output:

[ultranest] Found a lot of clusters: 79 (14 with >1 members)4.1250] | it/evals=306846/450144199 eff=0.0678% N=388
[ultranest] Found a lot of clusters: 108 (19 with >1 members).1250] | it/evals=306933/450281788 eff=0.0678% N=388
[ultranest] Found a lot of clusters: 41 (3 with >1 members)02.9062] | it/evals=307125/450578099 eff=0.0678% N=384
[ultranest] Found a lot of clusters: 47 (4 with >1 members)02.9062] | it/evals=307253/450776283 eff=0.0678% N=383
[ultranest] Found a lot of clusters: 41 (2 with >1 members)63.1016] | it/evals=307585/451267056 eff=0.0678% N=377
[ultranest] Found a lot of clusters: 44 (4 with >1 members)86.0547] | it/evals=307948/451774846 eff=0.0678% N=371
[ultranest] Found a lot of clusters: 68 (11 with >1 members)6.0547] | it/evals=308031/451883729 eff=0.0678% N=371
[ultranest] Found a lot of clusters: 103 (12 with >1 members).0391] | it/evals=308114/451993245 eff=0.0678% N=371
[ultranest] Found a lot of clusters: 134 (27 with >1 members).0391] | it/evals=308197/452095110 eff=0.0678% N=371
[ultranest] Found a lot of clusters: 43 (1 with >1 members)76.6094] | it/evals=308759/452751173 eff=0.0678% N=356
[ultranest] Found a lot of clusters: 37 (3 with >1 members)49.8672] | it/evals=308930/452932084 eff=0.0678% N=354
[ultranest] Found a lot of clusters: 69 (11 with >1 members)9.8672] | it/evals=309009/453014942 eff=0.0678% N=354
Traceback (most recent call last):82369.89 [81876.8125..81949.8672] | it/evals=309042/453049606 eff=0.0678% N=353
  File "/u/gdesvign/precess_cuda.py", line 447, in <module>
    ###sampler.run(min_num_live_points=nlive, viz_callback=False)
  File "/u/gdesvign/conda-envs/pulsar/lib/python3.10/site-packages/ultranest/integrator.py", line 2170, in run
    for result in self.run_iter(
  File "/u/gdesvign/conda-envs/pulsar/lib/python3.10/site-packages/ultranest/integrator.py", line 2382, in run_iter
    region_fresh = self._update_region(
  File "/u/gdesvign/conda-envs/pulsar/lib/python3.10/site-packages/ultranest/integrator.py", line 1827, in _update_region
    _update_region_bootstrap(self.region, nbootstraps, minvol, self.comm if self.use_mpi else None, self.mpi_size)
  File "/u/gdesvign/conda-envs/pulsar/lib/python3.10/site-packages/ultranest/integrator.py", line 390, in _update_region_bootstrap
    raise e
  File "/u/gdesvign/conda-envs/pulsar/lib/python3.10/site-packages/ultranest/integrator.py", line 367, in _update_region_bootstrap
    r, f = region.compute_enlargement(
  File "ultranest/mlfriends.pyx", line 863, in ultranest.mlfriends.MLFriends.compute_enlargement
numpy.linalg.LinAlgError: Distances are not positive

JohannesBuchner commented 2 years ago

This is the behaviour when the live points slowly become linearly dependent and collapse onto a sub-space.

This means you need more steps in the step sampler.

You may also want to try different generate_direction methods, such as unit directions.

gdesvignes commented 2 years ago

Thanks for the quick feedback. After some more testing, I found that the main issue was the likelihood, previously computed in single precision only. Then I doubled the number of steps, but it has become very slow and did not yet converge after a few days although it seems very close to it. In addition, I get a lot of "wandered out of L constraint; resetting". I would need to do some proper benchmarking, and maybe compare against PolyChord, but it seems the code spends a lot of time creating the live points.

Again, thanks for your input and this awesome package.

JohannesBuchner commented 2 years ago

Yes.

Setting region_class=ultranest.mlfriends.RobustEllipsoidRegion should help a lot. This makes only ellipsoids instead of also constructing MLFriends regions (which only help with <20d, and are computationally costly in high-d).

"wandered out of L constraint; resetting" means that a walker is, after a new iteration has increased the likelihood constraint, now outside, and has to be restarted. This only happens when you are running distributed with MPI, which maintains multiple walkers.

Recently, I completed implemention of a much faster, vectorized step sampler. I verified that it works correctly on a 100d gaussian. Perhaps you can try it as well, and give me feedback.

Here is how to run it. First import a few things:

from ultranest import ReactiveNestedSampler
from ultranest.mlfriends import RobustEllipsoidRegion
from ultranest.popstepsampler import PopulationSliceSampler, generate_cube_oriented_direction

Then make the sampler as usual. You need a vectorized likelihood to see speed-ups:

sampler = ReactiveNestedSampler(paramnames, loglike, transform=transform, vectorized=True)

Here is the definition of the step sampler:

sampler.stepsampler = PopulationSliceSampler(
    popsize=popsize, nsteps=nsteps, 
    generate_direction=generate_cube_oriented_direction, log=verbose
)

Choose the population of walkers to maintain, popsize. Perhaps try 40 to start. Choose the number of steps a walker makes. Perhaps start with 3 * ndim.

Then send it off. Here are the arguments I used:

results = sampler.run(
    frac_remain=0.01, update_interval_volume_fraction=0.01, 
    max_num_improvement_loops=0, min_num_live_points=400, 
    viz_callback=None, region_class=RobustEllipsoidRegion
)

A low update_interval_volume_fraction means the region construction only happens very rarely. Probably you want min_num_live_points to be at least a few times the dimensionality, so your 2000 sounds reasonable.

JohannesBuchner commented 2 years ago

The live points decreasing is caused by plateaus, i.e., multiple live points having the same likelihood value; they have to be removed together, and are only replaced by one new live point. See here.

Please investigate why multiple points cause the same likelihood value. If they have the exact same parameter values, the step sampler got stuck / did not make a significant move. If not, the likelihood is not well-defined.

Please reopen if this is still an issue.

JohannesBuchner / UltraNest

Very high-dimension sampling with decreasing number of live points #57

Description

What I Did