ejhigson / dyPolyChord

Super fast dynamic nested sampling with PolyChord (Python, C++ and Fortran likelihoods).
http://dypolychord.readthedocs.io/en/latest/
MIT License
22 stars 5 forks source link

Crashes in process_initial_run #6

Closed tilmantroester closed 5 years ago

tilmantroester commented 5 years ago

I've been working on writing up a CosmoSIS interface for dypolychord but I've been running into problems. I managed to distill it down to the toy problem of a 2d isotropic Gaussian (based on the example in the docs):

import dyPolyChord.python_likelihoods as likelihoods
import dyPolyChord.python_priors as priors
import dyPolyChord.pypolychord_utils
import dyPolyChord

ndim = 2
nderived = 0
likelihood = likelihoods.Gaussian(sigma=1.0, nderived=nderived)
prior = priors.Gaussian(sigma=2.0)

# Make a callable for running PolyChord
my_callable = dyPolyChord.pypolychord_utils.RunPyPolyChord(
    likelihood, prior, ndim, nderived=nderived)

# Specify sampler settings (see run_dynamic_ns.py documentation for more details)
dynamic_goal = 1.0  # whether to maximise parameter estimation or evidence accuracy. 
ninit = 10          # number of live points to use in initial exploratory run.
nlive_const = 50   # total computational budget is the same as standard nested sampling with nlive_const live points. 
settings_dict = {'file_root': 'gaussian',
                 'feedback' : 1,
                }

# Run dyPolyChord
dyPolyChord.run_dypolychord(my_callable, dynamic_goal, settings_dict,
                            ninit=ninit, nlive_const=nlive_const)

The error is (most of the time, it's a bit random if it occurs):

Traceback (most recent call last):
  File "mwe.py", line 29, in <module>
    ninit=ninit, nlive_const=nlive_const)
  File "/Users/yooken/Codes/miniconda/envs/test_cosmosis/lib/python3.7/site-packages/nestcheck-0.2.0-py3.7.egg/nestcheck/io_utils.py", line 28, in wrapper
    return func(*args, **kwargs)
  File "/Users/yooken/Codes/dypolychord/dyPolyChord/run_dynamic_ns.py", line 173, in run_dypolychord
    final_seed=final_seed)
  File "/Users/yooken/Codes/dypolychord/dyPolyChord/run_dynamic_ns.py", line 275, in process_initial_run
    resume_steps < dyn_info['peak_start_ind'])[0][-1]]
IndexError: index -1 is out of bounds for axis 0 with size 0

Sometime it finishes but gives warnings like

Warning, unable to proceed after      7: failed spawn events
ejhigson commented 5 years ago

@tilmantroester thank you for pointing this out!

This is error is caused by trying to start the dynamic nested sampling run by resuming the initial exploratory run whenever the optimal start point for the dynamic run is not sampling from the whole prior. However we do not save a resume file every step, so it throws an error if it tries to resume before the first resume file was saved. I fixed this in 0dd8636c9865632f1d11527193eda3254685d5ee - now the dynamic nested sampling starts from sampling the whole prior when there is no available resume file close enough to the start.

Let me know if any problems remain after this fix, and if not I will close the issue.

tilmantroester commented 5 years ago

Apologies for the delay getting back to this. It was working on a simple toy model but I've now ran it at scale where I got the same error message again:

Traceback (most recent call last):
  File "/home/ttroester/Codes/dyPolyChord/dyPolyChord/run_dynamic_ns.py", line 193, in run_dypolychord
    dynamic_goal=dynamic_goal)
  File "/home/ttroester/Codes/dyPolyChord/dyPolyChord/output_processing.py", line 114, in process_dypolychord_run
    run = combine_resumed_dyn_run(init, dyn, dyn_info['resume_ndead'])
  File "/home/ttroester/Codes/dyPolyChord/dyPolyChord/output_processing.py", line 203, in combine_resumed_dyn_run
    nestcheck.ns_run_utils.get_run_threads(init),
  File "/home/ttroester/Codes/nestcheck/nestcheck/ns_run_utils.py", line 152, in get_run_threads
    samples = array_given_run(ns_run)
  File "/home/ttroester/Codes/nestcheck/nestcheck/ns_run_utils.py", line 65, in array_given_run
    samples[-1, 2] = -1  # nlive drops to zero after final point
IndexError: index -1 is out of bounds for axis 0 with size 0
ejhigson commented 5 years ago

@tilmantroester this actually looks like a different problem to me. I think what is happening is that one of the the initial ("init") or dynamic ("dyn") runs contains zero samples, which throws an error when they are split into threads here:

  File "/home/ttroester/Codes/dyPolyChord/dyPolyChord/output_processing.py", line 203, in combine_resumed_dyn_run
    nestcheck.ns_run_utils.get_run_threads(init),

Please can you provide an example I can use to replicate the error?

Otherwise I suggest adding some print statements printing properties of "init" and "dyn" before line 200 of output_processing.py so you can check what it is about these runs which means they cannot be split into threads.

tilmantroester commented 5 years ago

This happens in a fairly complex pipeline, getting a simple example to replicate is going to be difficult. Since I run this on a large number of MPI workers, each of which only gets one CPU, splitting into threads probably doesn't do much at best and might upset the scheduler at worst. Is it possible to disable running on multiple threads?

Are there constraints on n_init, e.g., that it needs to be larger than the number of dimensions of the parameter space?

ejhigson commented 5 years ago

"threads" in get_run_threads refers to splitting up the data into single live point runs ("threads") after PolyChord has finished sampling (not to multi-threading of the computer process).

The new error is occuring after both the initial and dynamic runs have finished and looks to me like either the initial or the dynamic run has no samples in it. I expect the dynamic run has no samples in it (if the initial one had no samples then I think it would have thrown an error earlier). If so the problem is all your allotted samples are being used on the initial run, so you need the total number of samples available to the dynamic run to be bigger - i.e. increase nlive_const or max_ndead (whichever you use) relative to n_init. What values are you using for these? There is not much I can say without being able to replicate the issue.

In general having n_init greater than the number of dimensions is a good idea, although if my theory is correct you will also need to increase (nlive_const or max_ndead) by a larger fraction to avoid this error.

tilmantroester commented 5 years ago

Increasing n_init and nlive_const indeed seems to prevent the error from occurring.

ejhigson commented 5 years ago

Great! I will close this issue. If this occurs again then try printing out how many points there are in the initial and dynamic runs then increasing n_init and nlive_const to ensure this is more than zero.