brinckmann / montepython_public

Public repository for the Monte Python Code
MIT License
93 stars 77 forks source link

Is there a way to throw out points that segfaulted in CLASS? #295

Closed ash2223 closed 9 months ago

ash2223 commented 1 year ago

Hello! I'm using MontePython to run a modified version of CLASS that segfaults when certain parameters are too high or too low. When it does this, MontePython stops running and prints this output:

Screen Shot 2022-09-19 at 12 53 18 PM

I'm wondering if it's possible to throw out this rejected point and continue the analysis? I wonder if it involves changing lines 748-764 in sampler.py. Would greatly appreciate some guidance on this. Thank you so much!

-Adam

arsalanadil commented 1 year ago

Hi Adam,

Have you had any luck figuring out this problem? I'm facing a similar issue while using MultiNest, though the problem does seem to be stemming from the sampler bumping into some unreasonable part of parameter space (as reported here https://github.com/brinckmann/montepython_public/issues/166). The run aborts due to this segfault after a few hundred steps. I can then resume the run and sure enough it continues from where it had left off. I was thinking of writing a script that can resume the run whenever it fails but it would definitely be preferable to handle this from inside MontePython/CLASS.

Let me know if you find anything useful!

Best wishes, Arsalan

brinckmann commented 1 year ago

Hi Adam and Arsalan,

For Metropolis-Hastings I wrote a workaround on issue #266 , although I worry it doesn't work for a seg fault, which might crash the chain anyway, so you'd gradually run out of chains that are running (it could still work as long as the number of chains crashing is low compared to the total), but you can try. Can you predict when this might happen and instead return an error? That would be easier to deal with as you can then catch the error.

For MultiNest I don't think you have a choice other than finding a way to prevent the seg fault, since, as far as I know, it cannot deal with seg faults as it crashes the whole thing and you have to continue from the last save point, assuming things didn't get corrupted (which happens sometimes, speaking from experience - make backups every restart!). If you can instead produce an error it should be possible to do a similar workaround even for MultiNest, although I worry if it might affect the validity of the results, but I'm also not sure it would be a problem at all.

Best, Thejs

arsalanadil commented 1 year ago

Hi Thejs,

Thanks for your helpful comment. As in issue https://github.com/brinckmann/montepython_public/issues/266 , I'm also using non-rectangular priors. My strategy was to catch this in classy (during the dictionary setup, before doing any computation) and if a certain (physical) condition is violated it would throw a CosmoComputationError which, if I understand correctly, should cause MontePython to return the boundary_loglkl value so in theory the two approaches should be equivalent(?). Using your solution instead (i.e. catching the condition in MontePython in sampler.py) worked for MH, though NS was still throwing a segfault. It seems like the issue was with mpi; I was initializing MultiNest to use multiple processors (using --ntasks=2 in SLURM; and --cpus-per-task for dedicating OpenMP threads to CLASS) which it does not seem to like. Setting --ntasks=1 seems to be working so far.

When you use NS on a cluster, how do you typically distribute the job? Using all cpus on a node just for the CLASS computation doesn't seem to be the most efficient use of the resources.

I am also not sure if setting a prior in this manner is consistent with MultiNest's algorithm; will try to post an update once results become available.

Thanks again! Arsalan