BradGreig / Hybrid21CM

1 stars 3 forks source link

Error for certain range of parameters #30

Closed BellaNasirudin closed 5 years ago

BellaNasirudin commented 5 years ago

I tried to run a 4 parameter run with values:

params = dict(  # Parameter dict as described above.
    HII_EFF_FACTOR=[20.0, 10.0, 250.0, 3.0],
    ION_Tvir_MIN=[3.0, 1, 100, 0.1],
    L_X = [40.5, 38, 42, 0.1],
    NU_X_THRESH =[500, 100, 1500, 50],
)

But I am getting this error:

Traceback (most recent call last):
  File "/group/mwaeor/bnasirudin/PYENV/lib/python3.6/site-packages/py21cmmc/mcmc/ensemble.py", line 258, in _get_lnprob
    return super()._get_lnprob(pos)
  File "/group/mwaeor/bnasirudin/PYENV/lib/python3.6/site-packages/emcee/ensemble.py", line 382, in _get_lnprob
    results = list(M(self.lnprobfn, [p[i] for i in range(len(p))]))
  File "/pawsey/sles12sp3/apps/gcc/4.8.5/python/3.6.3/lib/python3.6/concurrent/futures/process.py", line 366, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/pawsey/sles12sp3/apps/gcc/4.8.5/python/3.6.3/lib/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "/pawsey/sles12sp3/apps/gcc/4.8.5/python/3.6.3/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/pawsey/sles12sp3/apps/gcc/4.8.5/python/3.6.3/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
Traceback (most recent call last):
  File "mwa_noise_nofg_full_run.py", line 44, in <module>
    model_name=model_name,             # Filename of main chain output
  File "/group/mwaeor/bnasirudin/updated_py21cmmc_fg/py21cmmc_fg/devel/test_series/base_definitions_full_run.py", line 159, in run_mcmc
    **kwargs
  File "/group/mwaeor/bnasirudin/PYENV/lib/python3.6/site-packages/py21cmmc/mcmc/mcmc.py", line 164, in run_mcmc
    sampler.startSampling()
  File "/group/mwaeor/bnasirudin/PYENV/lib/python3.6/site-packages/py21cmmc/mcmc/cosmoHammer/CosmoHammerSampler.py", line 94, in startSampling
    self.sample(pos, prob, rstate, datas)
  File "/group/mwaeor/bnasirudin/PYENV/lib/python3.6/site-packages/py21cmmc/mcmc/cosmoHammer/CosmoHammerSampler.py", line 209, in sample
    return self._sample(burninPos, burninProb, burninRstate, datas)
  File "/group/mwaeor/bnasirudin/PYENV/lib/python3.6/site-packages/py21cmmc/mcmc/cosmoHammer/CosmoHammerSampler.py", line 178, in _sample
    lnprob0=prob, rstate0=rstate, blobs0=datas
  File "/group/mwaeor/bnasirudin/PYENV/lib/python3.6/site-packages/py21cmmc/mcmc/ensemble.py", line 221, in sample
    lnprob[S0])
  File "/group/mwaeor/bnasirudin/PYENV/lib/python3.6/site-packages/py21cmmc/mcmc/ensemble.py", line 77, in _propose_stretch
    newlnprob, blob = self._get_lnprob(q)
  File "/group/mwaeor/bnasirudin/PYENV/lib/python3.6/site-packages/py21cmmc/mcmc/ensemble.py", line 258, in _get_lnprob
    return super()._get_lnprob(pos)
  File "/group/mwaeor/bnasirudin/PYENV/lib/python3.6/site-packages/emcee/ensemble.py", line 382, in _get_lnprob
    results = list(M(self.lnprobfn, [p[i] for i in range(len(p))]))
  File "/pawsey/sles12sp3/apps/gcc/4.8.5/python/3.6.3/lib/python3.6/concurrent/futures/process.py", line 366, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/pawsey/sles12sp3/apps/gcc/4.8.5/python/3.6.3/lib/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "/pawsey/sles12sp3/apps/gcc/4.8.5/python/3.6.3/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/pawsey/sles12sp3/apps/gcc/4.8.5/python/3.6.3/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
MWAGridTestNoise-SKA-full-run.error lines 303096-303137/303137 (END)

which is caused by one of these parameters:

BrokenProcessPool exception (most likely an unrecoverable crash in C-code). 

  Due to the nature of this exception, it is impossible to know which of the following parameter 
  vectors were responsible for the crash. Running your likelihood function with each set
  of parameters in serial may help identify the problem.

  params: [[  10.06541604    3.11151112   41.78655541  180.41686159]
           [  60.17506568    2.50292984   38.2202358   242.78809371]
           [  18.97050506    2.68941325   39.74769184  738.3964061 ]
           [  24.53481405    3.75349005   39.96549189  733.16242217]
           [  11.36524806    5.44773919   39.58850157 1082.43101063]
           [  12.56871109    2.8827531    40.21327025  108.39579539]
           [  16.30566918    3.53050365   40.0227228   568.0812901 ]
           [  10.14339114    4.96885941   41.19516333  532.67160518]
           [  11.3916606     6.81043577   39.5305134  1288.64045945]
           [  14.75523148    3.87760746   40.58954876  777.45722608]
           [  10.36850974    3.13998029   40.25528709  664.38857844]
           [  14.94170538    4.33920275   39.9024568   706.35818372]
           [  38.79113987    1.37707039   41.08550366  605.67724173]
           [  14.9665405     5.8629486    40.47444626  573.45487626]
           [  10.80565254    3.90627723   40.31337903  612.16313433]
           [  79.02102658    1.60462401   39.40910121  274.87129869]
           **[  13.03970787   10.66930846   38.54805908  273.53996706]**
           [  15.43921183    3.48638858   40.65417193  172.75993724]
           [  34.22687204    3.27181617   40.92138859  384.56414316]
           [  36.55988518    3.16500598   39.79626781 1022.54467073]
           [  18.37422856    3.31974344   40.58156533  573.96534319]
           [  21.54037738    3.46860117   40.5520102   543.13510574]
           [  15.51890305    4.00252894   41.1012786   999.52970818]
           [  16.04387361    4.53125358   40.1283463   624.29636452]
           [  21.04337816    3.71668455   39.69963573  753.44445821]
           [  25.82284389    3.18143221   40.82237578  709.34218706]
           [  13.74868071    4.04906808   40.4264593   556.97174848]
           [  14.54751271    3.01973512   41.14025161  519.01736797]
           [  11.91793825    3.77234532   40.89449506  695.69609986]
           [  10.64423324    2.02155485   40.52775645  263.19636544]
           [  31.66027092    3.08588962   39.47131457  649.3805506 ]
           [  11.29139777    2.64784554   40.48431347  558.85751499]]
  args: []
  kwargs: {}
  exception:

I then ran Hybrid21cm again with these parameters and got this error for the values in bold above:

2019-04-08 11:24:12 | INFO    | UsefulFunctions.c | writeAstroParams:553 [pid=31775] | AstroParams: [HII_EFF_FACTOR=13.039708, ION_Tvir_MIN=46699094016.000000, X_RAY_Tvir_MIN=46699094016.000000, R_BUBBLE_MAX=15.000000, L_X=3.532312e+38, NU_X_THRESH=273.539978, X_RAY_SPEC_INDEX=1.000000, F_STAR10=0.050119, F_STAR=0.500000, N_RSD_STEPS=0.000000]
2019-04-08 11:24:12 | INFO    | UsefulFunctions.c | writeFlagOptions:559 [pid=31775] | AstroParams: [USE_MASS_DEPENDENT_ZETA=0, SUBCELL_RSD=0, INHOMO_RECO=0, USE_TS_FLUCT=0]
2019-04-08 11:24:12 | SUPER-DEBUG | IonisationBox.c | ComputeIonizedBox:69 [pid=31775] | defined parameters
2019-04-08 11:24:12 | SUPER-DEBUG | IonisationBox.c | ComputeIonizedBox:136 [pid=31775] | erfc interpolation done
2019-04-08 11:24:12 | SUPER-DEBUG | IonisationBox.c | ComputeIonizedBox:179 [pid=31775] | density field calculated
2019-04-08 11:24:12 | SUPER-DEBUG | IonisationBox.c | ComputeIonizedBox:210 [pid=31775] | minimum source mass has been set: 1086022819126444032.000000
2019-04-08 11:24:12 | SUPER-DEBUG | IonisationBox.c | ComputeIonizedBox:216 [pid=31775] | sigma table has been initialised
2019-04-08 11:24:12 | ULTRA-DEBUG | ps.c            | FgtrM_General:862 [pid=31775] | integration range: 41.529054 to 46.134224
gsl: qag.c:261: ERROR: could not integrate function
Default GSL error handler invoked.
/var/spool/slurm/job2979243/slurm_script: line 21: 31775 Aborted                 (core dumped) python3 test_chain.py
BradGreig commented 5 years ago

Hi Bella,

This reminds me that I still need to add parameter ranges to the documentation!

For ION_Tvir_MIN, you should set the range to be [4.0,6.0]. It can go below 4.0, but generates a discontinuity for Tvir as it switches gas types for halo mass. Selecting this range above would be consistent with all other works of ours that use ION_Tvir_MIN.

steven-murray commented 5 years ago

@BradGreig along with adding parameter ranges to documentation, we should also investigate this specific case and return a reasonable exception from the C code if it's something that we know will error. This will allow the MCMC sampler to catch it, return -inf and continue, rather than crashing.

i.e. this is related to #19

ghost commented 5 years ago

Just to add to this, in case our prior ranges are equally flawed.

I have some students running the higher parameterisation (i.e. that of Park+ 2018 but without spin temp) and are experiencing a similar problem. Our params dictionary looks like this:

FitParams = dict(F_STAR10 = [-1.3, -3.0, 0.0, 0.1], ALPHA_STAR = [0.5, -0.5, 1.0, 0.05], F_ESC10 = [-1.0, -3.0, 0.0, 0.1], ALPHA_ESC = [-0.5, -1.0, 0.5, 0.05], M_TURN = [8.7, 8.0, 10.0, 0.1], t_STAR = [0.5, 0.0, 1.0, 0.05]).

With 8 threads the code consistently crashes after ~40 iteration with it sometimes returning the same exception as described above. However, they have observed that if they run in iterations batches of 30 continuing sampling, then it stalls much less regularly. I wonder if there might be a memory leak issue of some sort on top of the discontinuity issues in some regions of parameter space? Apologies if such a memory leak issue has already been raised elsewhere.

steven-murray commented 5 years ago

@caw11 I did find a memory leak issue a while back, and I think I fixed it (see https://github.com/BradGreig/Hybrid21CM/issues/29#issuecomment-476333842). However, with these kinds of issues it can be hard to be certain. What your students are getting certainly smells like a memory leak. There's a script in the devel/ directory called memory_leak_test.py which you could try running with a bit of modification to suit your purposes, to check if there's a memory leak. If you get the time to do that, definitely let us know the outcome!

BellaNasirudin commented 5 years ago

I have been (still am) experiencing memory leak issue as well. I am using this version commit 450d0966cd8b92beae0052f41466dd605da313ae

In my case, it seems to come from the storage module in 21cmmc since when I opted out of doing that, the code ran through perfectly fine and only used 100 GB. Of course, the downside is that I cannot analyse the walkers if there is something wrong with the MCMC and will have to run everything again.

With the storage module included, I am using at least 915 GB so it's very computationally expensive.

steven-murray commented 5 years ago

@BellaNasirudin thanks for the report. This seems to be a different issue than the one you originally posted here. Can you file a separate issue for it, and create a minimum working example? When you say "storage" module, do you mean storing the arbitrary derived data?

steven-murray commented 5 years ago

@caw11 your original issue has moved to https://github.com/21cmfast/21cmFAST/issues/16.

@caw11 and @BellaNasirudin the memory leak issue has been moved to https://github.com/21cmfast/21CMMC/issues/4