BradGreig / Hybrid21CM

1 stars 3 forks source link

Crashing during runtime - relatively rarely, after >~10000 iterations in 2 parameter model #23

Closed catherinewatkinson closed 5 years ago

catherinewatkinson commented 5 years ago

During tests, the code is crashing (on Ubuntu 16.04.10 running anaconda installation of python 3.6.6 and gcc v5.4.0). Occurs after over 10,000 iterations have been run (obviously randomly) in a two parameter model. Crash report as follows:

chain = mcmc.run_mcmc(core, likelihood, datadir=‘data’, model_name=model_name, params=dict( HII_EFF_FACTOR = [30.0, 10.0, 50.0, 3.0], ION_Tvir_MIN = [4.7, 4, 6, 0.1],), walkersRatio=8, burninIterations=0, sampleIterations=1000000, threadCount=16, c$
 File “/home/caw11/anaconda2/envs/py366/lib/python3.6/site-packages/py21cmmc/mcmc/mcmc.py”, line 123, in run_mcmc
   sampler.startSampling()
 File “/home/caw11/anaconda2/envs/py366/lib/python3.6/site-packages/py21cmmc/mcmc/cosmoHammer/CosmoHammerSampler.py”, line 92, in startSampling
   self.sample(pos, prob, rstate, datas)
 File “/home/caw11/anaconda2/envs/py366/lib/python3.6/site-packages/py21cmmc/mcmc/cosmoHammer/CosmoHammerSampler.py”, line 203, in sample
   return self._sample(burninPos, burninProb, burninRstate, datas)
 File “/home/caw11/anaconda2/envs/py366/lib/python3.6/site-packages/py21cmmc/mcmc/cosmoHammer/CosmoHammerSampler.py”, line 176, in _sample
   lnprob0=prob, rstate0=rstate, blobs0=datas
 File “/home/caw11/anaconda2/envs/py366/lib/python3.6/site-packages/emcee/ensemble.py”, line 259, in sample
   lnprob[S0])
 File “/home/caw11/anaconda2/envs/py366/lib/python3.6/site-packages/py21cmmc/mcmc/ensemble.py”, line 71, in _propose_stretch
   newlnprob, blob = self._get_lnprob(q)
 File “/home/caw11/anaconda2/envs/py366/lib/python3.6/site-packages/emcee/ensemble.py”, line 382, in _get_lnprob
   results = list(M(self.lnprobfn, [p[i] for i in range(len(p))]))
 File “/home/caw11/anaconda2/envs/py366/lib/python3.6/concurrent/futures/process.py”, line 366, in _chain_from_iterable_of_lists
   for element in iterable:
 File “/home/caw11/anaconda2/envs/py366/lib/python3.6/concurrent/futures/_base.py”, line 586, in result_iterator
   yield fs.pop().result()
 File “/home/caw11/anaconda2/envs/py366/lib/python3.6/concurrent/futures/_base.py”, line 432, in result
   return self.__get_result()
 File “/home/caw11/anaconda2/envs/py366/lib/python3.6/concurrent/futures/_base.py”, line 384, in __get_result
   raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
catherinewatkinson commented 5 years ago

Running under the new parametrisation (see below) crashes earlier >~3000 iterations.

chain = mcmc.run_mcmc(
    core, power_spec, datadir='data', 
    model_name=MODEL_NAME, 
    params=dict(
        F_STAR10=[-1.301029996, -3, 0, 0.1], 
        ALPHA_STAR=[0.5, -0.5, 1, 0.05], 
        F_ESC10=[-1, -3, 0, 0.1], 
        ALPHA_ESC=[-0.5, -1, 0.5, 0.05], 
        M_TURN=[8.698970004, 8, 10, 0.1], 
        t_STAR=[0.5, 0, 1, 0.05], 
        L_X=[40.5, 38, 42, 0.15], 
        NU_X_THRESH=[500, 100, 1500, 50] ), 
    walkersRatio=WALK_RATIO, 
    burninIterations=BURN, 
    sampleIterations=ITER, 
    threadCount=THREADS, 
    continue_sampling=CONT 
)
BradGreig commented 5 years ago

Hmm, never encountered this one before. Appears to be a python multiprocessing issue occurring within emcee.

I will say, I have never run 21CMMC for this many iterations though (it should be well and truly converged by this point). What I tend to run is a much larger number of walkers (i.e. a much larger walkersRatio) and a lower iteration number. Its not really a solution, but should avoid the issue.

But, its something we'll try and look into.

steven-murray commented 5 years ago

Hmm, never encountered this one before. Appears to be a python multiprocessing issue occurring within emcee.

The exception raised here is something that I added not too long ago, by using the processing Pool from concurrent futures. Basically, when some C-code in the background exits with SIGSEGV (or some other un-catchable exception), instead of just hanging, the pool crashes with this message. Unfortunately, it doesn't give much info about why it crashed.

I think our best bet is to inspect the parameters that are being run on every iteration, and check which ones give the crash, and then try to reproduce from there. I'll comment back with instructions for how to run while writing out current parameters ASAP.

steven-murray commented 5 years ago

OK, sorry for getting back so late on this. There is now an option in run_mcmc called "save_precomputed_parameters". If you set that to True, a file will be created (and kept updated on every iteration) alongside the other data files, that literally just has the current parameters in it (it gets saved before those parameters are evaluated, so if it crashes, you can easily identify the set of parameters that are being used). Note that if the crash happened in Python itself, the traceback will report the current parameters (exactly the ones that failed) anyway. It's just when the C crashes that this will be useful.

steven-murray commented 5 years ago

Actually, as I thought about this, it became apparent that this is a sub-par solution (if the crash happens and you haven't had the saving parameters turned on, then you'd have to run again to get the parameters, which might not happen since the process is random!). I've switched out this behavior for a more consistent crash reporting, which now will report exactly the parameters that were being evaluated when the code crashed (this is a list of parameter sets, rather than just the single parameters that failed, because it can't detect which process failed as far as I can tell).

Long story short -- don't worry about my previous comment. Hopefully if this error pops up again, you'll have something more useful to report so we can investigate!

steven-murray commented 5 years ago

Closing this issue as having been addressed, but if the crash happens again and you have some more details on why (in terms of the C code) please re-open and we can try fixing it at the source.