brinckmann / montepython_public

Public repository for the Monte Python Code
MIT License
93 stars 79 forks source link

Segmentation fault #261

Closed ritase7e closed 2 years ago

ritase7e commented 2 years ago

Hello,

I am trying to run chains with the following 'cosmo' parameters in my .param file:

data.parameters['omega_b']      = [  2.2377,   None, None,      0.015, 0.01, 'cosmo']
data.parameters['omega_cdm']    = [ 0.12010,   None, None,     0.0013,    1, 'cosmo']
data.parameters['100*theta_s']  = [ 1.04110,   None, None,    0.00030,    1, 'cosmo']
data.parameters['ln10^{10}A_s'] = [  3.0447,   None, None,      0.015,    1, 'cosmo']
data.parameters['n_s']          = [  0.9659,   None, None,     0.0042,    1, 'cosmo']
data.parameters['tau_reio']     = [  0.0543,  0.004, None,      0.008,    1, 'cosmo']
data.parameters['A_L']          = [    1,   None, None,    0.5,    1, 'cosmo']

This is just the 6 LCDM paramters plus A_L, a lensing parameter that is already defined in CLASS with that name. I have not modified CLASS.

In runs where I have only 6 of these 7 parameters in any combination of them I have no issue, it does not complain about any of the parameters or anything else. However in runs with all 7 parameters it runs for at most an hour (sometimes way less) and then produces a segmentation fault:

/var/spool/slurm/job07053/slurm_script: line 15: 21290 Segmentation fault (core dumped)

It seems to run fine until that point as the txt is being filled while it runs. That is the only error it produces, so I don't have anymore clues as to what might be causing it.

If I run with mpirun the error is:

mpirun noticed that process rank 3 with PID 19178 on node ftlab21 exited on signal 11 (Segmentation fault).

(maybe there is some extra information here that could be a clue for someone).

Finally if I try to restart the chain then it only computes for a few minutes before giving the same error again. Running a new chain from the same folder (so that it starts from the log.param) most of the times does the same. However, if I run again from the original .param file into a new output folder it is able to run again for about an hour (or less but significantly more time than when I try in the same folder).

I am running on a cluster, in case that is relevant, and I have asked the IT person for help and he has checked that it is not a problem with the machine (by running on different machines) or with memory/ processing power, by monitoring these as the process runs.

Any help would be appreciated.

I'll leave bellow the full .param file in case there is something there or anyone is willing to run this and see if it produces the same error (it is just the base2018TTTEEE_lensing.param with the addition of 'A_L').

Thanks in advance, Rita Neves

Edit to add that I tried running again today and the error now gives a little more information (don't know why), which might mean something:

Fatal Python error: Segmentation fault

Current thread 0x00007f3781c11540 (most recent call first):
  File "/home/rneves/software/montepython_public/montepython/likelihood_class.py", line 1050 in loglkl
  File "/home/rneves/software/montepython_public/montepython/sampler.py", line 776 in compute_lkl
  File "/home/rneves/software/montepython_public/montepython/mcmc.py", line 787 in chain
  File "/home/rneves/software/montepython_public/montepython/sampler.py", line 46 in run
  File "/home/rneves/software/montepython_public/montepython/run.py", line 45 in run
  File "montepython/MontePython.py", line 42 in <module>
/var/spool/slurm/job07310/slurm_script: line 15: 44526 Segmentation fault      (core dumped) python3.6 montepython/MontePython.py run -p input/base2018TTTEEE_lensing.param -o chains/planck/lasttest -N 10000

------------------------ .param file I am using ------------------------

#------Experiments to test (separated with commas)-----

data.experiments=['Planck_highl_TTTEEE', 'Planck_lowl_EE', 'Planck_lowl_TT', 'Planck_lensing']

#------ Settings for the over-sampling.
# The first element will always be set to 1, for it is the sampling of the
# cosmological parameters. The other numbers describe the over sampling of the
# nuisance parameter space. This array must have the same dimension as the
# number of blocks in your run (so, 1 for cosmological parameters, and then 1
# for each experiment with varying nuisance parameters).
# Note that when using Planck likelihoods, you definitely want to use [1, 4],
# to oversample as much as possible the 14 nuisance parameters.
# Remember to order manually the experiments from slowest to fastest (putting
# Planck as the first set of experiments should be a safe bet, except if you
# also have LSS experiments).
# If you have experiments without nuisance, you do not need to specify an
# additional entry in the over_sampling list (notice for instance that, out of
# the three Planck likelihoods used, only Planck_highl requires nuisance
# parameters, therefore over_sampling has a length of two (cosmology, plus one
# set of nuisance).
data.over_sampling=[1, 5]

#------ Parameter list -------

# data.parameters[class name] = [mean, min, max, 1-sigma, scale, role]
# - if min max irrelevant, put to None
# - if fixed, put 1-sigma to 0
# - if scale irrelevant, put to 1, otherwise to the appropriate factor
# - role is either 'cosmo', 'nuisance' or 'derived'. You should put the derived
# parameters at the end, and in case you are using the `-j fast` Cholesky
# decomposition, you should order your nuisance parameters from slowest to
# fastest.

# Cosmological parameters list

data.parameters['omega_b']      = [  2.2377,   None, None,      0.015, 0.01, 'cosmo']
data.parameters['omega_cdm']    = [ 0.12010,   None, None,     0.0013,    1, 'cosmo']
data.parameters['100*theta_s']  = [ 1.04110,   None, None,    0.00030,    1, 'cosmo']
data.parameters['ln10^{10}A_s'] = [  3.0447,   None, None,      0.015,    1, 'cosmo']
data.parameters['n_s']          = [  0.9659,   None, None,     0.0042,    1, 'cosmo']
data.parameters['tau_reio']     = [  0.0543,  0.004, None,      0.008,    1, 'cosmo']
data.parameters['A_L']          = [    1,   None, None,    0.5,    1, 'cosmo']

# Nuisance parameter list, same call, except the name does not have to be a class name

data.parameters['A_cib_217']         = [    47.2,     0,   200,     6.2593,     1, 'nuisance']
data.parameters['cib_index']         = [    -1.3,  -1.3,  -1.3,          0,     1, 'nuisance']
data.parameters['xi_sz_cib']         = [    0.42,     0,     1,       0.33,     1, 'nuisance']
data.parameters['A_sz']              = [    7.23,     0,    10,     1.4689,     1, 'nuisance']
data.parameters['ps_A_100_100']      = [   251.0,     0,   400,     29.438,     1, 'nuisance']
data.parameters['ps_A_143_143']      = [    47.4,     0,   400,     9.9484,     1, 'nuisance']
data.parameters['ps_A_143_217']      = [    47.3,     0,   400,     11.356,     1, 'nuisance']
data.parameters['ps_A_217_217']      = [   119.8,     0,   400,     10.256,     1, 'nuisance']
data.parameters['ksz_norm']          = [    0.01,     0,    10,     2.7468,     1, 'nuisance']
data.parameters['gal545_A_100']      = [    8.86,     0,    50,     1.8928,     1, 'nuisance']
data.parameters['gal545_A_143']      = [   11.10,     0,    50,     1.8663,     1, 'nuisance']
data.parameters['gal545_A_143_217']  = [    19.8,     0,   100,     3.8796,     1, 'nuisance']
data.parameters['gal545_A_217']      = [    95.1,     0,   400,     6.9759,     1, 'nuisance']
data.parameters['galf_EE_A_100']     = [   0.055, 0.055, 0.055,          0,     1, 'nuisance']
data.parameters['galf_EE_A_100_143'] = [   0.040, 0.040, 0.040,          0,     1, 'nuisance']
data.parameters['galf_EE_A_100_217'] = [   0.094, 0.094, 0.094,          0,     1, 'nuisance']
data.parameters['galf_EE_A_143']     = [   0.086, 0.086, 0.086,          0,     1, 'nuisance']
data.parameters['galf_EE_A_143_217'] = [    0.21,  0.21,  0.21,          0,     1, 'nuisance']
data.parameters['galf_EE_A_217']     = [    0.70,  0.70,  0.70,          0,     1, 'nuisance']
data.parameters['galf_EE_index']     = [    -2.4,  -2.4,  -2.4,          0,     1, 'nuisance']
data.parameters['galf_TE_A_100']     = [   0.114,     0,    10,   0.038762,     1, 'nuisance']
data.parameters['galf_TE_A_100_143'] = [   0.134,     0,    10,   0.030096,     1, 'nuisance']
data.parameters['galf_TE_A_100_217'] = [   0.482,     0,    10,   0.086185,     1, 'nuisance']
data.parameters['galf_TE_A_143']     = [   0.224,     0,    10,   0.055126,     1, 'nuisance']
data.parameters['galf_TE_A_143_217'] = [   0.664,     0,    10,   0.082349,     1, 'nuisance']
data.parameters['galf_TE_A_217']     = [    2.08,     0,    10,    0.27175,     1, 'nuisance']
data.parameters['galf_TE_index']     = [    -2.4,  -2.4,  -2.4,          0,     1, 'nuisance']
data.parameters['calib_100T']        = [  999.69,     0,  3000,    0.61251, 0.001, 'nuisance']
data.parameters['calib_217T']        = [  998.16,     0,  3000,    0.63584, 0.001, 'nuisance']
data.parameters['calib_100P']        = [   1.021, 1.021, 1.021,          0,     1, 'nuisance']
data.parameters['calib_143P']        = [   0.966, 0.966, 0.966,          0,     1, 'nuisance']
data.parameters['calib_217P']        = [   1.040, 1.040, 1.040,          0,     1, 'nuisance']
data.parameters['A_cnoise_e2e_100_100_EE'] = [ 1,     1,     1,          0,     1, 'nuisance']
data.parameters['A_cnoise_e2e_143_143_EE'] = [ 1,     1,     1,          0,     1, 'nuisance']
data.parameters['A_cnoise_e2e_217_217_EE'] = [ 1,     1,     1,          0,     1, 'nuisance']
data.parameters['A_sbpx_100_100_TT'] = [       1,     1,     1,          0,     1, 'nuisance']
data.parameters['A_sbpx_143_143_TT'] = [       1,     1,     1,          0,     1, 'nuisance']
data.parameters['A_sbpx_143_217_TT'] = [       1,     1,     1,          0,     1, 'nuisance']
data.parameters['A_sbpx_217_217_TT'] = [       1,     1,     1,          0,     1, 'nuisance']
data.parameters['A_sbpx_100_100_EE'] = [       1,     1,     1,          0,     1, 'nuisance']
data.parameters['A_sbpx_100_143_EE'] = [       1,     1,     1,          0,     1, 'nuisance']
data.parameters['A_sbpx_100_217_EE'] = [       1,     1,     1,          0,     1, 'nuisance']
data.parameters['A_sbpx_143_143_EE'] = [       1,     1,     1,          0,     1, 'nuisance']
data.parameters['A_sbpx_143_217_EE'] = [       1,     1,     1,          0,     1, 'nuisance']
data.parameters['A_sbpx_217_217_EE'] = [       1,     1,     1,          0,     1, 'nuisance']
data.parameters['A_planck']          = [ 1.00061,   0.9,   1.1,     0.0025,     1, 'nuisance']
data.parameters['A_pol']             = [       1,     1,     1,          0,     1, 'nuisance']

# Derived parameters

data.parameters['z_reio']          = [1, None, None, 0,     1,   'derived']
data.parameters['Omega_Lambda']    = [1, None, None, 0,     1,   'derived']
data.parameters['YHe']             = [1, None, None, 0,     1,   'derived']
data.parameters['H0']              = [0, None, None, 0,     1,   'derived']
data.parameters['A_s']             = [0, None, None, 0,  1e-9,   'derived']
data.parameters['sigma8']          = [0, None, None, 0,     1,   'derived']

# Other cosmo parameters (fixed parameters, precision parameters, etc.)

data.cosmo_arguments['sBBN file'] = data.path['cosmo']+'/external/bbn/sBBN.dat'
data.cosmo_arguments['k_pivot'] = 0.05

# The base model features two massless
# and one massive neutrino with m=0.06eV.
# The settings below ensures that Neff=3.046
# and m/omega = 93.14 eV
data.cosmo_arguments['N_ur'] = 2.0328
data.cosmo_arguments['N_ncdm'] = 1
data.cosmo_arguments['m_ncdm'] = 0.06
data.cosmo_arguments['T_ncdm'] = 0.71611

# These two are required to get sigma8 as a derived parameter
# (class must compute the P(k) until sufficient k)
data.cosmo_arguments['output'] = 'mPk'
data.cosmo_arguments['P_k_max_h/Mpc'] = 1.

# The Planck Lensing likelihood is more precise when the non-linear effects are taken
# into consideration. For this you can use halofit (default) or hmcode.
# If you are running an exotic model for which the non-linearities cannot be
# computed with either of these codes, you are advised to comment out the following line.
data.cosmo_arguments['non linear'] = 'halofit'

#------ Mcmc parameters ----

data.N=10
data.write_step=5
brinckmann commented 2 years ago

Hi Rita,

You might want to check if your parameters are leaving the valid range of CLASS since they're not restricted by prior limits. Normally that's not a problem for runs including Planck, but since you're adding A_L I'm not sure if that could happen. At the very least I would expect to need to add a lower limit on A_L > 0, but that might be enough.

Best, Thejs

ritase7e commented 2 years ago

Hi Thejs,

Thank you so much for your quick reply. It seems like that was indeed the issue! I have added a lower limit on A_L > 0 and it has been running for 1h40m now, which is more than it ever did before.

I wonder why this was never an issue with less parameters even when A_L was included, but it definitely seems to have fixed it!

I'll close this issue now, if it turns it is not fixed I'll reopen.

Thanks again! Rita