Is it possible to resume sampling from the last saved checkpoint of a stopped run?

ejhigson / dyPolyChord

Super fast dynamic nested sampling with PolyChord (Python, C++ and Fortran likelihoods).

http://dypolychord.readthedocs.io/en/latest/

MIT License

22 stars 5 forks source link

Is it possible to resume sampling from the last saved checkpoint of a stopped run? #7

Closed ajshajib closed 5 years ago

ajshajib commented 5 years ago

For example, if I run the sampler on a cluster with limited walltime and cores per job and the stopping criteria can not be achieved within this limited walltime, can I call the sampler to resume sampling from the last saved .resume file?

ejhigson commented 5 years ago

Hi @ajshajib - it is possible for PolyChord to resume sampling runs (indeed dyPolyChord uses this feature). However setting up for dyPolyChord to pause and resume is a little more complicated as dyPolyChord uses multiple steps - first doing an initial run exploratory run, then analysing it and doing a second dynamic nested sampling run.

I never added the functionality to resume dyPolyChord but it should be possible to build as I think PolyChord runs with a varying number of live points can be resumed - @williamjameshandley is this correct? If you want to then please do implement it and make a pull request. The best option may be to only build functionality to resume after the dynamic run has started (so the process cannot be resumed until after it finished the initial exploratory run and calculates the numbers of live points) as this will be simpler and should provide most of the benefits.

williamjameshandley commented 5 years ago

I never added the functionality to resume dyPolyChord but it should be possible to build as I think PolyChord runs with a varying number of live points can be resumed - @williamjameshandley is this correct?

This is correct, since that's how dyPolyChord works (we added this functionality a while back for the dynamic nested sampling paper).

ajshajib commented 5 years ago

@ejhigson, thanks for agreeing to add this feature. Hopefully, it would benefit other users in the future as well.

@williamjameshandley, I have a quick question in the meantime. When I use the emcee sampler with MPI, I would take the number of walkers to be a multiple of the number of cores, so that roughly all the cores are working at any given time. What I understood from your 2015 paper is that this is not necessary for PolyChord, as the slave cores are continuously working to find new live points. Did I understand it right or can there be some efficiency loss in between integer values of nlive/ncore?

williamjameshandley commented 5 years ago

@williamjameshandley, I have a quick question in the meantime. When I use the emcee sampler with MPI, I would take the number of walkers to be a multiple of the number of cores, so that roughly all the cores are working at any given time. What I understood from your 2015 paper is that this is not necessary for PolyChord, as the slave cores are continuously working to find new live points. Did I understand it right or can there be some efficiency loss in between integer values of nlive/ncore?

The master slave parallelization is pretty efficient in this regard, so yes, you can use any number of cores/live points, but I would advise keeping nlive > ncores (see figure 5)

ajshajib commented 5 years ago

@williamjameshandley Thanks a lot for the clarification. Yes, I am keeping nlive > ncores.

ejhigson commented 5 years ago

@ajshajib the new resume_dyn_run setting I added in 404c525d6512e6872f1086036ee098ec89dec0f5 should give you what you need. Let me know if any problems and otherwise I will close the issue. Note that this will only allow resuming after the initial exploratory run is complete and the dynamic run has started.

ajshajib commented 5 years ago

Thanks for adding the feature! I tested and it works perfectly.

ajshajib commented 5 years ago

@ejhigson Just in case, if the sampling stops during the initial exploratory run, can the sampler resume the initial static run from the last resume file, if I include read_resume: True in settings_dict_in to pass to PolyChord sampler from dyPolyChord.run_dynamic_ns.run_dypolychord()?

ejhigson commented 5 years ago

@ejhigson Just in case, if the sampling stops during the initial exploratory run, can the sampler resume the initial static run from the last resume file, if I include read_resume: True in settings_dict_in to pass to PolyChord sampler from dyPolyChord.run_dynamic_ns.run_dypolychord()?

No unfortunately this won't work as during the initial exploratory run dyPolyChord periodically saves files to allow it to resume the initial run at different points. For this to work dyPolyChord would need to not only resume the inital run but also check the files it expected to have already created by this point exist with the expected filenames. Currently if you try this then dyPolyChord will issue a warning and proceed with read_resume=False.

You can try and implement a change if you want and send a pull request but I think it may be a bit fiddly to do it safely so I don't think it is worth it (resuming after the dynamic nested sampling part of the process has started should be more useful).