brinckmann / montepython_public

Public repository for the Monte Python Code
MIT License
92 stars 78 forks source link

segmentation fault while running LCDM #339

Closed ClaudioNahmad closed 11 months ago

ClaudioNahmad commented 11 months ago

Hello again,

i've come across a segmentation fault while running standard CLASS (3.2.0) on Montepython (3.6):

#BAO
data.experiments=['bao_smallz_2014','bao_eBOSS_DR16_ELG','bao_eBOSS_DR16_gal_QSO','bao_eBOSS_DR16_Lya_auto','bao_eBOSS_DR16_Lya_cross_QSO']

#Table 1, Plik[1], arxiv:1807.06209 (Small 1-sigma)

data.parameters['omega_b']      = [  2.2377,    0.5, None,      0.015, 0.01, 'cosmo']
data.parameters['omega_cdm']    = [ 0.12000,   0.00, None,     0.0012,    1, 'cosmo']
data.parameters['100*theta_s']  = [ 1.04092,   None, None,    0.00031,    1, 'cosmo']
data.parameters['ln10^{10}A_s'] = [   3.044,   None, None,      0.014,    1, 'cosmo']
data.parameters['n_s']          = [  0.9649,   None, None,     0.0042,    1, 'cosmo']
data.parameters['tau_reio']     = [  0.0544,  0.004, None,     0.0073,    1, 'cosmo']

#--------------------------------------------------------------------------------------
# DERIVED PARAMETERS ------------------------------------------------------------------
#--------------------------------------------------------------------------------------
#All datasets
data.parameters['z_reio']           = [1     ,      -1,       -1,  0,     1, 'derived']
#data.parameters['YHe']              = [1     ,    None,     None,  0,     1, 'derived']
data.parameters['H0']               = [0     ,       0,     None,  0,     1, 'derived']
data.parameters['Omega_m']          = [1     ,       0,        1,  0,     1, 'derived']
data.parameters['Omega_Lambda']     = [1     ,       0,        1,  0,     1, 'derived']

#--------------------------------------------------------------------------------------
# CLASS ARGUMENTS (SPECIFIC) ----------------------------------------------------------
#--------------------------------------------------------------------------------------
data.cosmo_arguments['N_ur'] = 2.0328
data.cosmo_arguments['N_ncdm'] = 1
data.cosmo_arguments['m_ncdm'] = 0.06
data.cosmo_arguments['T_ncdm'] = 0.71611

(i tried running sigma8 as a derived parameter and the code didnt want to run, but that's another problem)

So, when i run this on parallel, on the NERSC cluster, following the advice given on https://github.com/brinckmann/montepython_public/issues/326, MP starts running ok, then after ~0.2 of the total chain length, a segmentation fault occurs. Here are several outputs (ps. srun is the equivalent to mpirun in the NERSC cluster):

    srun: error: nid004894: task 7: Segmentation fault
srun: Terminating StepId=13283254.0
1  7.10454  1.747858e+00    1.153645e-01    1.039674e+00    2.960997e+00    1.022887e+00    9.437531e-02    1.320914e+01    6.419846e+01    3.238847e-01    6.760275e-01    
2  7.01907  1.731033e+00    1.141989e-01    1.039571e+00    2.943492e+00    1.018156e+00    9.234660e-02    1.305472e+01    6.447217e+01    3.179316e-01    6.819813e-01    
1  6.98494  1.746123e+00    1.149170e-01    1.039513e+00    2.939066e+00    1.013361e+00    8.179241e-02    1.193729e+01    6.430056e+01    3.217326e-01    6.781799e-01    
1  6.95629  1.747785e+00    1.145373e-01    1.039742e+00    2.921275e+00    1.012178e+00    8.179457e-02    1.192044e+01    6.452941e+01    3.185826e-01    6.813304e-01    
1  7.13392  1.774128e+00    1.158158e-01    1.039892e+00    2.927113e+00    1.011146e+00    7.493435e-02    1.113010e+01    6.430662e+01    3.245229e-01    6.753896e-01    
1  6.89157  1.774184e+00    1.150432e-01    1.039548e+00    2.906882e+00    1.014252e+00    7.843363e-02    1.147403e+01    6.448424e+01    3.208807e-01    6.790322e-01    
3  6.92996  1.780248e+00    1.152023e-01    1.039399e+00    2.910193e+00    1.011902e+00    7.275496e-02    1.085916e+01    6.442461e+01    3.220044e-01    6.779083e-01    
1  6.83885  1.759897e+00    1.138318e-01    1.038970e+00    2.924029e+00    1.013108e+00    7.051330e-02    1.066544e+01    6.464202e+01    3.160752e-01    6.838381e-01    
2  6.83289  1.758407e+00    1.142359e-01    1.039063e+00    2.931890e+00    1.012816e+00    7.239067e-02    1.088303e+01    6.450773e+01    3.183279e-01    6.815850e-01    
3  6.88606  1.743626e+00    1.138586e-01    1.038961e+00    2.934195e+00    1.019127e+00    7.180616e-02    1.087039e+01    6.450257e+01    3.171167e-01    6.827963e-01    
1  7.10314  1.740167e+00    1.151530e-01    1.039391e+00    2.918128e+00    1.022178e+00    7.288071e-02    1.103323e+01    6.412672e+01    3.239084e-01    6.760035e-01    
1  6.9962   1.742715e+00    1.149149e-01    1.039816e+00    2.919075e+00    1.017429e+00    7.699835e-02    1.145438e+01    6.437209e+01    3.209306e-01    6.789820e-01    
1  7.14198  1.755199e+00    1.156713e-01    1.039977e+00    2.918109e+00    1.017191e+00    7.726308e-02    1.145016e+01    6.423904e+01    3.243969e-01    6.755153e-01    
1  7.32903  1.757498e+00    1.158646e-01    1.039648e+00    2.940326e+00    1.021306e+00    7.648256e-02    1.136313e+01    6.407946e+01    3.265416e-01    6.733702e-01    
1  6.95643  1.754765e+00    1.149799e-01    1.039858e+00    2.938250e+00    1.016768e+00    8.159590e-02    1.188254e+01    6.445516e+01    3.205505e-01    6.793623e-01    
1  7.00691  1.754636e+00    1.151537e-01    1.040386e+00    2.937438e+00    1.018837e+00    7.878847e-02    1.159756e+01    6.455905e+01    3.199335e-01    6.799796e-01    
2  7.04859  1.751690e+00    1.153089e-01    1.039797e+00    2.937344e+00    1.020711e+00    9.585015e-02    1.333408e+01    6.428862e+01    3.229351e-01    6.769772e-01    
1  7.22586  1.781518e+00    1.159733e-01    1.039758e+00    2.925603e+00    1.022062e+00    9.377021e-02    1.300699e+01    6.426379e+01    3.255157e-01    6.743966e-01    
1  6.73718  1.796211e+00    1.138896e-01    1.039013e+00    2.927152e+00    1.015181e+00    8.268592e-02    1.178146e+01    6.491963e+01    3.143767e-01    6.855374e-01    
1  6.7015   1.786255e+00    1.141768e-01    1.038511e+00    2.944150e+00    1.015621e+00    8.214196e-02    1.177585e+01    6.457088e+01    3.182316e-01    6.816816e-01    
4  8.54374  1.758716e+00    1.168367e-01    1.039225e+00    2.942413e+00    1.022830e+00    9.724635e-02    1.348288e+01    6.359562e+01    3.339628e-01    6.659477e-01    
1  8.04421  1.764236e+00    1.167857e-01    1.039747e+00    2.928475e+00    1.019679e+00    9.757025e-02    1.348569e+01    6.382464e+01    3.315807e-01    6.683304e-01    
1  7.4002   1.734767e+00    1.159847e-01    1.040046e+00    2.910657e+00    1.020085e+00    1.056920e-01    1.439395e+01    6.398535e+01    3.272409e-01    6.726707e-01    
1  7.27499  1.685801e+00    1.147167e-01    1.040199e+00    2.922124e+00    1.018438e+00    1.116396e-01    1.518851e+01    6.413122e+01    3.214803e-01    6.784317e-01    
1  7.21167  1.716638e+00    1.143696e-01    1.040191e+00    2.924989e+00    1.022223e+00    1.149799e-01    1.531064e+01    6.449572e+01    3.177636e-01    6.821493e-01    
1  7.25608  1.710878e+00    1.155303e-01    1.040480e+00    2.924546e+00    1.024899e+00    1.109340e-01    1.500742e+01    6.410818e+01    3.243013e-01    6.756106e-01    
2  7.34742  1.699071e+00    1.155265e-01    1.040130e+00    2.939598e+00    1.026484e+00    1.209598e-01    1.601254e+01    6.390724e+01    3.260454e-01    6.738659e-01    
--> Scanning file ../chains/lcdm_BAO_23_8_4_III/2023-08-04_50000__4.txt : Removed 0 points of burn-in, and first 50 percent, keep 173 steps
                                                2023-08-04_50000__6.txt : Removed 0 points of burn-in, and first 50 percent, keep 208 steps
                                                2023-08-04_50000__8.txt : Removed 3 points of burn-in, and first 50 percent, keep 194 steps
                                                2023-08-04_50000__9.txt : Removed 2 points of burn-in, and first 50 percent, keep 194 steps
                                                2023-08-04_50000__16.txt: Removed 2 points of burn-in, and first 50 percent, keep 209 steps
                                                2023-08-04_50000__11.txt: Removed 1 points of burn-in, and first 50 percent, keep 175 steps
                                                2023-08-04_50000__13.txt: Removed 0 points of burn-in, and first 50 percent, keep 208 steps
                                                2023-08-04_50000__15.txt: Removed 0 points of burn-in, and first 50 percent, keep 208 steps
                                                2023-08-04_50000__10.txt: Removed 2 points of burn-in, and first 50 percent, keep 197 steps
                                                2023-08-04_50000__7.txt : Removed 0 points of burn-in, and first 50 percent, keep 198 steps
                                                2023-08-04_1__1.txt     : Removed 0 points of burn-in, and first 50 percent, keep 1 steps
                                                2023-08-04_50000__2.txt : Removed 0 points of burn-in, and first 50 percent, keep 190 steps
                                                2023-08-04_50000__5.txt : Removed 0 points of burn-in, and first 50 percent, keep 203 steps
                                                2023-08-04_50000__3.txt : Removed 0 points of burn-in, and first 50 percent, keep 220 steps
                                                2023-08-04_50000__1.txt : Removed 0 points of burn-in, and first 50 percent, keep 205 steps
                                                2023-08-04_50000__14.txt: Removed 5 points of burn-in, and first 50 percent, keep 200 steps
                                                2023-08-04_50000__12.txt: Removed 1 points of burn-in, and first 50 percent, keep 212 steps
--> Computing mean values
--> Computing variance
--> Computing convergence criterium (Gelman-Rubin)
 -> R-1 is 5.064547     for  omega_b
           4.493510     for  omega_cdm
           6.979938     for  100*theta_s
           5.121398     for  ln10^{10}A_s
           4.782014     for  n_s
           4.398298     for  tau_reio
           3.711180     for  z_reio
           2.974130     for  H0
           2.294413     for  Omega_m
           0.845521     for  Omega_Lambda
--> Not computing covariance matrix
1  9.05033  1.684320e+00    1.172912e-01    1.040015e+00    2.915725e+00    1.030951e+00    1.360838e-01    1.755533e+01    6.310688e+01    3.384289e-01    6.614802e-01    
3  8.327    1.698683e+00    1.167322e-01    1.039922e+00    2.897397e+00    1.034988e+00    1.335687e-01    1.721140e+01    6.339162e+01    3.343618e-01    6.655481e-01    
1  7.13282  1.710762e+00    1.147365e-01    1.039935e+00    2.900338e+00    1.034343e+00    1.232671e-01    1.612204e+01    6.423011e+01    3.211442e-01    6.787681e-01    
1  7.145    1.717040e+00    1.150437e-01    1.039713e+00    2.910144e+00    1.031740e+00    1.361234e-01    1.724789e+01    6.409186e+01    3.234317e-01    6.764801e-01    
1  7.54727  1.705781e+00    1.137792e-01    1.040263e+00    2.924273e+00    1.026623e+00    1.341791e-01    1.709628e+01    6.465882e+01    3.144907e-01    6.854226e-01    
1  7.23403  1.718502e+00    1.155102e-01    1.040232e+00    2.896776e+00    1.031926e+00    1.465723e-01    1.816879e+01    6.409518e+01    3.245694e-01    6.753425e-01    
1  7.25335  1.714507e+00    1.139448e-01    1.039875e+00    2.867276e+00    1.035308e+00    1.408341e-01    1.762973e+01    6.453866e+01    3.162699e-01    6.836431e-01    
1  7.45451  1.722356e+00    1.134911e-01    1.039857e+00    2.866681e+00    1.040100e+00    1.401025e-01    1.749496e+01    6.476573e+01    3.131616e-01    6.867520e-01    
1  7.49238  1.720438e+00    1.132892e-01    1.039692e+00    2.852360e+00    1.039814e+00    1.338219e-01    1.694888e+01    6.477484e+01    3.125466e-01    6.873671e-01    
2  7.05399  1.701541e+00    1.141879e-01    1.038902e+00    2.834800e+00    1.040121e+00    1.230401e-01    1.613467e+01    6.403571e+01    3.215342e-01    6.783775e-01    
slurmstepd: error: *** STEP 13283254.0 ON nid004894 CANCELLED AT 2023-08-04T21:32:02 ***
1  6.07917  2.230879e+00    1.180627e-01    3  6.1487   1.958242e+00    1.133906e-01    1.037226e+00    3.345435e+00    1  5.88539  1  6.31449  2.234732e+00    1.119270e-01    1  7.00467  1.705850e+00    1.137555e-01    1.038759e+00    2.831574e+00    1.044565e+00    1.233598e-01    1  5.79178  2.057129e+00    1  6.772    2.106157e+00    1.089912e-01    1.032566e+00    2.841386e+00    9.621017e-01    3  6.27312  2.378358e+00    1.194595e-01    1.043443e+00    1  7.05429  2.012685e+00    1.215514e-01    1  5.75535  2.313848e+00    2  5.94149  2.464677e+00    1.227907e-01    1.044645e+00    3.000874e+00    1.046368e+00    9.024250e-03                                                    2023-08-04_50000__16.txt: Removed 2 points of burn-in, and first 50 percent, 1  5.73873 2.686884e+00    1.221447e-01    1.042043e+00    3.255561e+00    1  6.6395   srun: error: nid004894: tasks 0-6,8-15: Terminated
srun: Force Terminated StepId=13283254.0

we can see several srun: errors, specially at the beginning of the block, 'srun: error segmentation fault' and one slurmstepd: error.

Here's another segmentation fault error from another run:

2023-08-04_50000__14.txt: Removed 1 points of burn-in, and first 50 percent, keep 217 steps
                                               2023-08-04_50000__12.txt: Removed 2 points of burn-in, and first 50 percent, keep 179 steps
--> Computing mean values
--> Computing variance
                                               2023-08-04_50000__5.txt : Removed 1 points of burn-in, srun: error: nid004710: task 4: Segmentation fault
srun: Terminating StepId=13281915.0
slurmstepd: error: *** STEP 13281915.0 ON nid004710 CANCELLED AT 2023-08-04T21:04:41 ***
3  6.53696  2.225775e+00    3  7.30564  2.350597e+00    1.323462e-01    1  5.84873  2.729844e+00    1.245072e-01    1.045358e+00    1.194167e+00    1.117885e+00    1.466900e-01    1.384225e+01    2.472673e-01    7.160125e+01    2.973621e-01    1  8.53398  2.189230e+00    1.208165e-01    2  6.08287  3.485446e+00    1.326106e-01    1.046217e+00    1.793728e+00    1.074078e+00    4.429926e-02    5.109512e+00    2.494985e-01    7.677745e+01    2.851832e-01    4  6.12312  2.360060e+00    1.216769e-01    1.045504e+00    3.856514e+00    5.826946e-01    1.158842e-01    1.278098e+01    2.459076e-01    6.912432e+01    3.053922e-01    8  6.83849  1  7.02402  2.548078e+00    1.158565e-01    1.038777e+00    3.759350e+00    7.579524e-01    1  7.07087  2.214242e+00    1.205178e-01    1.046887e+00    4.839758e+00    13  6.53993 2.068536e+00    1.012779e-01    1.017050e+00    3  8.55225  2.864141e+00    1.041253e-01    1.016006e+00    srun: error: nid004710: tasks 0-3,5-15: Terminated
srun: Force Terminated StepId=13281915.0

I have tried to change prior 1-sigmas and lower limits but nothing changes, any hint on what might be happening?

Thanks!