Closed catherinewatkinson closed 5 years ago
Running under the new parametrisation (see below) crashes earlier >~3000 iterations.
chain = mcmc.run_mcmc(
core, power_spec, datadir='data',
model_name=MODEL_NAME,
params=dict(
F_STAR10=[-1.301029996, -3, 0, 0.1],
ALPHA_STAR=[0.5, -0.5, 1, 0.05],
F_ESC10=[-1, -3, 0, 0.1],
ALPHA_ESC=[-0.5, -1, 0.5, 0.05],
M_TURN=[8.698970004, 8, 10, 0.1],
t_STAR=[0.5, 0, 1, 0.05],
L_X=[40.5, 38, 42, 0.15],
NU_X_THRESH=[500, 100, 1500, 50] ),
walkersRatio=WALK_RATIO,
burninIterations=BURN,
sampleIterations=ITER,
threadCount=THREADS,
continue_sampling=CONT
)
Hmm, never encountered this one before. Appears to be a python multiprocessing issue occurring within emcee.
I will say, I have never run 21CMMC for this many iterations though (it should be well and truly converged by this point). What I tend to run is a much larger number of walkers (i.e. a much larger walkersRatio) and a lower iteration number. Its not really a solution, but should avoid the issue.
But, its something we'll try and look into.
Hmm, never encountered this one before. Appears to be a python multiprocessing issue occurring within emcee.
The exception raised here is something that I added not too long ago, by using the processing Pool from concurrent futures. Basically, when some C-code in the background exits with SIGSEGV (or some other un-catchable exception), instead of just hanging, the pool crashes with this message. Unfortunately, it doesn't give much info about why it crashed.
I think our best bet is to inspect the parameters that are being run on every iteration, and check which ones give the crash, and then try to reproduce from there. I'll comment back with instructions for how to run while writing out current parameters ASAP.
OK, sorry for getting back so late on this. There is now an option in run_mcmc
called "save_precomputed_parameters". If you set that to True, a file will be created (and kept updated on every iteration) alongside the other data files, that literally just has the current parameters in it (it gets saved before those parameters are evaluated, so if it crashes, you can easily identify the set of parameters that are being used). Note that if the crash happened in Python itself, the traceback will report the current parameters (exactly the ones that failed) anyway. It's just when the C crashes that this will be useful.
Actually, as I thought about this, it became apparent that this is a sub-par solution (if the crash happens and you haven't had the saving parameters turned on, then you'd have to run again to get the parameters, which might not happen since the process is random!). I've switched out this behavior for a more consistent crash reporting, which now will report exactly the parameters that were being evaluated when the code crashed (this is a list of parameter sets, rather than just the single parameters that failed, because it can't detect which process failed as far as I can tell).
Long story short -- don't worry about my previous comment. Hopefully if this error pops up again, you'll have something more useful to report so we can investigate!
Closing this issue as having been addressed, but if the crash happens again and you have some more details on why (in terms of the C code) please re-open and we can try fixing it at the source.
During tests, the code is crashing (on Ubuntu 16.04.10 running anaconda installation of python 3.6.6 and gcc v5.4.0). Occurs after over 10,000 iterations have been run (obviously randomly) in a two parameter model. Crash report as follows: