Efficiency warning when fitting a dataset with a number of duplicates

pboorm commented 4 years ago

UltraNest version: 2.1.2
Python version: 3.7
Operating System: Mac OS X 10.14.6

Description

I am trying to use bootstrapping with UltraNest to test the effect of outliers on straight line fits to some data. To do this, the code resamples the original dataset allowing repeats, N times and fits with UltraNest each time. However, if a large enough portion of the resampled dataset are repeats, the fit doesn't finish, and I get the following warning about efficiency:

UserWarning: Sampling from region seems inefficient. You can try increasing nlive, frac_remain, dlogz, dKL, decrease min_ess). [0/40 accepted, it=2500] warnings.warn("Sampling from region seems inefficient. You can try increasing nlive, frac_remain, dlogz, dKL, decrease min_ess). [%d/%d accepted, it=%d]" % (accepted.sum(), ndraw, nit))

I've tried altering the arguments mentioned in the warning, and also Lepsilon, but the values I've tried haven't fixed the issue. I've also tried narrowing the initial priors on the slope and offset, but the allowed range had to be so small to not actually include the maximum posterior parameters derived from other fits to the data without duplicates.

Fixing the maximum number of duplicates allowed in the resampled dataset worked, but this might bias the interpretation from bootstrapping the data.

What I Did

(Here, df is a pandas dataframe).

def prior_transform(cube):
    params = cube.copy()
    lo = -10.
    hi = +10.
    params[0] = cube[0] * (hi - lo) + lo
    lo = -50.
    hi = +50.
    params[1] = cube[1] * (hi - lo) + lo
    lo = np.log10(0.001)
    hi = np.log10(10)
    params[2] = 10**(cube[2] * (hi - lo) + lo)
    return params

def log_likelihood(params):
    slope, offset, scatter = params
     y_expected = (samples[:,0]) * slope + offset
    probs_samples = scipy.stats.norm(y_expected, scatter).pdf(samples[:,1])
    probs_objects = probs_samples.mean(axis=1)
    assert len(probs_objects) == n_data
    loglike = np.log(probs_objects + 1e-100).sum()
    return loglike

n_data = len(df)
print(str(n_data) + " sources")
bs_df = nh_df.sample(n = n_data, replace = True).reset_index(drop = True)
duplicates = len(bs_df) - len(bs_df.drop_duplicates())

print(str(duplicates) + " dupilicates")
n_data = len(bs_df)

## this generates a sample of points for every datapoint in bs_df, given xPar and yPar
samples = get_samples(bs_df, xPar, yPar)

sampler = ultranest.ReactiveNestedSampler(parameters, log_likelihood, prior_transform, log_dir = saveLoc)
result = sampler.run(Lepsilon = 0.5)

Output is as follows:
26 sources
INFO:ultranest:PointStore: have 0 items
INFO:ultranest:Sampling 400 live points from prior ...
14 dupilicates
Creating directory for new run
DEBUG:ultranest:minimal_widths_sequence: [(-inf, 400.0), (inf, 400.0)]
VBox(children=(HTML(value=''), GridspecLayout(children=(HTML(value="<div style='background-color:#6E6BF4;'>&nb…
Z=-74.9(0.34%) | Like=-67.52..-60.34 [-67.5999..-64.7221] | it/evals=2347/6383 eff=39.2278% N=400 0    
DEBUG:ultranest:clustering found some stray points [need_accept=False] (array([1, 2]), array([399,   1]))
Z=-74.0(0.81%) | Like=-66.48..-60.34 [-67.5999..-64.7221] | it/evals=2429/6772 eff=38.1199% N=400 
DEBUG:ultranest:clustering found some stray points [need_accept=False] (array([1, 2]), array([399,   1]))
Z=-70.3(26.93%) | Like=-62.48..-57.80 [-62.4761..-62.4759]*| it/evals=3157/10117 eff=32.4895% N=400 
DEBUG:ultranest:clustering found some stray points [need_accept=False] (array([1, 2]), array([399,   1]))
Z=-69.3(70.66%) | Like=-61.03..-56.61 [-61.0316..-61.0316]*| it/evals=4046/24286 eff=16.9388% N=400 
DEBUG:ultranest:clustering found some stray points [need_accept=False] (array([1, 2, 3]), array([125, 274,   1]))
Z=-69.3(72.70%) | Like=-60.93..-56.61 [-60.9329..-60.9328]*| it/evals=4129/27378 eff=15.3051% N=400 
DEBUG:ultranest:clustering found some stray points [need_accept=False] (array([1, 2, 3]), array([ 95, 304,   1]))
Z=-69.1(87.03%) | Like=-59.17..-53.94 [-59.1669..-59.0719]*| it/evals=5321/212271 eff=2.5114% N=400 
UserWarning: Sampling from region seems inefficient. You can try increasing nlive, frac_remain, dlogz, dKL, decrease min_ess). [0/40 accepted, it=2500]
  warnings.warn("Sampling from region seems inefficient. You can try increasing nlive, frac_remain, dlogz, dKL, decrease min_ess). [%d/%d accepted, it=%d]" % (accepted.sum(), ndraw, nit))

JohannesBuchner commented 4 years ago

Could you provide a data set which reproduces the problem, so that the example is self-contained?

JohannesBuchner commented 4 years ago

The warning is not problematic, it just lets you know that the problem is difficult. Fitting a line with relatively few data points is actually a really hard problem to do correctly, as the distribution tails become important. UltraNest tries hard to do it correctly.

You can also try to use a "step sampler", which can overcome heavy tails and solve high-dimensional problems.

sampler.stepsampler = ultranest.stepsampler.RegionSliceSampler(nsteps=100, adaptive_nsteps='move-distance')

This is mentioned in the high-dim tutorial https://johannesbuchner.github.io/UltraNest/example-sine-highd.html

pboorm commented 4 years ago

Thanks a lot for the suggestion! This has enabled the difficult fits to finish without an efficiency warning by setting nsteps to a fixed value (instead of the adaptive_nsteps method). Linked to this, I have a couple of follow-up queries:

are there any caveats I should be aware of with using a step sampler instead of the regular method? E.g., could this be more prone to local minima if nsteps is not large enough?
are there any recommendations for choosing the number of steps in the step sampler, or is the adaptive_nsteps generally favoured?

JohannesBuchner commented 4 years ago

If you use adaptive_nsteps='move-distance', the number of steps is varied to try to adjust to the problem difficulty, nsteps is only the starting value. It may be that it tunes to very small nsteps values, speeding up drastically. The initial nsteps should be reasonably high so there are no mistakes in the first few iterations, so 100-400 is what I usually use.

pboorm commented 4 years ago

Thanks! I've been using the steps ampler a lot more, which has helped in the majority of cases.

Attached is a minimal working example with a dataset and Jupyter notebook to read and fit the data with UltraNest (without and with the stepsampler). This particular dataset example does not have a very strong correlation to start with, likely more so after bootstrapping allowing repeats. Would it be detrimental to the final result to set a finite max_ncalls prior to running the sampler?

ultranest_example.zip

JohannesBuchner commented 4 years ago

That should be fine, but you may get only one effective sample (and thus collapsing uncertainties) if it does not reach convergence.

JohannesBuchner commented 4 years ago

I would try frac_remain=0.5

JohannesBuchner commented 3 years ago

I am closing this issue, but please reopen if you still have problems with the latest version. I improved the warning to be more helpful.

JohannesBuchner / UltraNest

Efficiency warning when fitting a dataset with a number of duplicates #8

Description

What I Did