CobayaSampler / cobaya

Code for Bayesian Analysis
http://cobaya.readthedocs.io/en/latest/
Other
131 stars 129 forks source link

Segmentation fault (11) with PolyChord+Class+lowEE+Omega_k #102

Open lukashergt opened 4 years ago

lukashergt commented 4 years ago

Hi @JesusTorrado,

sorry to bring this up again, but I never quite managed to fix the remaining problem in #34. The original issue described there was related to a memory leak, which got mostly fixed. However, the segfault described later on persists. I've got a new computing setup and can now test these things locally with a gnu build and can say that this is not an intel issue, I get this problem with both gnu and intel. But with this new setup I was able to get information on which rank/process causes the issue.

This error comes up when running Cobaya with PolyChord, Class, lowEE likelihood and Omega_k varied. I never got it for flat LCDM and I never got it when I excluded the lowEE likelihood.

Below I'm posting the .yaml file I used, the error message and an overview of my computing setup. I am also attaching the complete output file including debug output.

I have tested the parameter set that seems to have caused the error (from rank 15). Running that parameter set directly with Class causes no errors. I've also tried to fix all those parameters in the .yaml file (while running over a dummy variable) and again did not get any errors. This has me at a loss as to what might be happening.


Full output with debug

test_EE_omegak_3d_debug_witherr.log


Example .yaml file (reduced to 3 varying parameters):

debug: true
likelihood:
  planck_2018_lowl.EE:
params:
  A_s: 2.101e-9
  n_s: 0.9649
  Omega_k:
    prior:
      min: -0.15
      max: 0.15
    latex: \Omega_k
  100*theta_s: 1.04090
  omega_b:
    prior:
      min: 0.019
      max: 0.025
    latex: \Omega_\mathrm{b} h^2
  omega_cdm:
    prior:
      min: 0.025
      max: 0.471
    latex: \Omega_\mathrm{c} h^2
  m_ncdm: 0.06
  tau_reio: 0.0544
sampler:
  polychord:
    blocking: 
      - [1, [omega_b, omega_cdm, Omega_k]]
      - [10, [A_planck]]
    nlive: 50
theory:
  classy:
    extra_args:
      N_ncdm: 1
      N_ur: 2.0328

Error message:

[lukas-amd3950x:159002] *** Process received signal ***
[lukas-amd3950x:159002] Signal: Segmentation fault (11)
[lukas-amd3950x:159002] Signal code: Address not mapped (1)
[lukas-amd3950x:159002] Failing at address: 0x5631b9095900
[lukas-amd3950x:159002] [ 0] /usr/lib/libc.so.6(+0x3bd70)[0x7f01f2711d70]
[lukas-amd3950x:159002] [ 1] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(simall_lkl+0x17a)[0x7f01c7dd31ff]
[lukas-amd3950x:159002] [ 2] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(lklbs_lkl+0x20e)[0x7f01c7d808db]
[lukas-amd3950x:159002] [ 3] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(distribution_lkl+0x146)[0x7f01c7d93ab3]
[lukas-amd3950x:159002] [ 4] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(clik_compute+0x5b)[0x7f01c7d7f427]
[lukas-amd3950x:159002] [ 5] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/python/site-packages/clik/lkl.cpython-38-x86_64-linux-gnu.so(+0x9717)[0x7f01d800a717]
[lukas-amd3950x:159002] [ 6] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/python/site-packages/clik/lkl.cpython-38-x86_64-linux-gnu.so(+0x8332)[0x7f01d8009332]
[lukas-amd3950x:159002] [ 7] /usr/lib/libpython3.8.so.1.0(_PyObject_MakeTpCall+0x442)[0x7f01f244c3d2]
[lukas-amd3950x:159002] [ 8] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x4ea1)[0x7f01f2509c51]
[lukas-amd3950x:159002] [ 9] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0xc34)[0x7f01f24f6154]
[lukas-amd3950x:159002] [10] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x39b)[0x7f01f24f6c7b]
[lukas-amd3950x:159002] [11] /usr/lib/libpython3.8.so.1.0(+0x1e1694)[0x7f01f24f7694]
[lukas-amd3950x:159002] [12] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x2f8)[0x7f01f2452508]
[lukas-amd3950x:159002] [13] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2314)[0x7f01f25070c4]
[lukas-amd3950x:159002] [14] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x3d4)[0x7f01f24f58f4]
[lukas-amd3950x:159002] [15] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x39b)[0x7f01f24f6c7b]
[lukas-amd3950x:159002] [16] /usr/lib/libpython3.8.so.1.0(+0x1e1694)[0x7f01f24f7694]
[lukas-amd3950x:159002] [17] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x2f8)[0x7f01f2452508]
[lukas-amd3950x:159002] [18] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2314)[0x7f01f25070c4]
[lukas-amd3950x:159002] [19] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x3d4)[0x7f01f24f58f4]
[lukas-amd3950x:159002] [20] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x39b)[0x7f01f24f6c7b]
[lukas-amd3950x:159002] [21] /usr/lib/libpython3.8.so.1.0(+0x1e1694)[0x7f01f24f7694]
[lukas-amd3950x:159002] [22] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x2f8)[0x7f01f2452508]
[lukas-amd3950x:159002] [23] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2314)[0x7f01f25070c4]
[lukas-amd3950x:159002] [24] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x3d4)[0x7f01f24f58f4]
[lukas-amd3950x:159002] [25] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x39b)[0x7f01f24f6c7b]
[lukas-amd3950x:159002] [26] /usr/lib/libpython3.8.so.1.0(+0x1e1694)[0x7f01f24f7694]
[lukas-amd3950x:159002] [27] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x2f8)[0x7f01f2452508]
[lukas-amd3950x:159002] [28] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2314)[0x7f01f25070c4]
[lukas-amd3950x:159002] [29] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0xc34)[0x7f01f24f6154]
[lukas-amd3950x:159002] *** End of error message ***
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 15 with PID 0 on node lukas-amd3950x exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

My setup:

version
gcc & gfortran 9.3.0
mpirun (Open MPI) 4.0.3
python 3.8.2
cobaya 3.0
pypolychord 1.17.1
classy 2.9.3
JesusTorrado commented 4 years ago

Noted, thanks! Will test very soon.

Just to confirm, the point that you are testing is

right?

lukashergt commented 4 years ago

Yes, that is how I have interpreted the process rank 15 output by the error... That is the parameter set that I've tested directly with Class and with an additional Cobaya run (all parameters fixed and a dummy parameter sampled). In both cases I was not able to reproduce the error.

JesusTorrado commented 4 years ago

Thanks, I'll give it a try.

If you are unable to reproduce it, it does sound ulgy: may be due to persistence of some memory allocation between clik calls. We'll see...

lukashergt commented 4 years ago

output from cobaya showing nan

I sprinkled some additional self.log.debug statements (in _planck_clik_prototype.py's log_likelihood function just before the call to clik) and got some more helpful output: test_EE_omegak_3d_debug_with_nans.log

Search for [21 : from the bottom then you'll find the last debug statements from rank 21 that show that the error seems to be related to nans being passed to clik. Here is the reduced output:

 2020-06-09 11:24:21,866 [21 : model] Posterior to be computed for parameters {'Omega_k': -0.14091940813704676, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'A_planck': 0.9982462593219659}
 2020-06-09 11:24:21,866 [21 : prior] Evaluating prior at array([-0.14091941,  0.02335145,  0.02863166,  0.99824626])
 2020-06-09 11:24:21,867 [21 : prior] Got logpriors = [11.95388244568207]
 2020-06-09 11:24:21,867 [21 : model] Got input parameters: {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'A_planck': 0.9982462593219659}
 2020-06-09 11:24:21,867 [21 : classy] Got parameters {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544}
 2020-06-09 11:24:21,867 [21 : classy] Computing new state
 2020-06-09 11:24:21,867 [21 : classy] Setting parameters: {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'N_ncdm': 1, 'N_ur': 2.0328, 'output': 'lCl pCl', 'lensing': 'yes', 'l_max_scalars': 29}
 2020-06-09 11:24:48,001 [21 : planck_2018_lowl.ee] Got parameters {'A_planck': 0.9982462593219659}
 2020-06-09 11:24:48,001 [21 : planck_2018_lowl.ee] Computing new state
 2020-06-09 11:24:48,002 [21 : planck_2018_lowl.ee] Calling logp now
 2020-06-09 11:24:48,002 [21 : planck_2018_lowl.ee] Got cl = {'ee': array([ 0.,  0., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan]), 'bb': array([ 0.,  0., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan]), 'pp': array([0.00000000e+00, 0.00000000e+00, 1.17011902e-08, 3.56360975e-09,
       1.47036184e-09, 7.22453571e-10, 3.98247801e-10, 2.38223791e-10,
       1.51464579e-10, 1.00957867e-10, 6.98689585e-11, 4.91361840e-11,
       3.60046913e-11, 2.69636667e-11, 2.05746303e-11, 1.59570984e-11,
       1.25538757e-11, 1.00018945e-11, 8.05876718e-12, 6.55896153e-12,
       5.38708729e-12, 4.46126736e-12, 3.72245994e-12, 3.12746369e-12,
       2.64424851e-12, 2.24853374e-12, 1.92279547e-12, 1.65213682e-12,
       1.42652861e-12, 1.23737431e-12]), 'ell': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])}
 2020-06-09 11:24:48,008 [21 : planck_2018_lowl.ee] Call clik now with vector = [0.         0.                nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
 0.99824626]
--------------------------------------------------------------------------
mpirun noticed that process rank 21 with PID 0 on node lukas-amd3950x exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

test parameter set

I've tested the parameter set from rank 21 with the following script, where the .yaml file corresponds to the input from my first post.

from cobaya.model import get_model
from cobaya.input import load_input

yaml = load_input('test_omegak_EE_3d.yaml')
model = get_model(yaml)
params = {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'N_ncdm': 1, 'N_ur': 2.0328, 'output': 'lCl pCl', 'lensing': 'yes', 'l_max_scalars': 29, 'A_planck': 0.9982462593219659}
model.logps(params)

This is the corresponding output file: lukas_bug2.log. Same input parameters, but this time no nans in the Cls!!! Is the above script a bad representation of what happens in an actual cobaya run? Is the debug output not accurate enough? Here is the reduced output:

 2020-06-09 12:13:19,745 [model] Got input parameters: {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'N_ncdm': 1, 'N_ur': 2.0328, 'output': 'lCl pCl', 'lensing': 'yes', 'l_max_scalars': 29, 'A_planck': 0.9982462593219659}
 2020-06-09 12:13:19,745 [classy] Got parameters {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544}
 2020-06-09 12:13:19,746 [classy] Computing new state
 2020-06-09 12:13:19,746 [classy] Setting parameters: {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'N_ncdm': 1, 'N_ur': 2.0328, 'output': 'lCl pCl', 'lensing': 'yes', 'l_max_scalars': 29}
 2020-06-09 12:13:27,545 [planck_2018_lowl.ee] Got parameters {'A_planck': 0.9982462593219659}
 2020-06-09 12:13:27,545 [planck_2018_lowl.ee] Computing new state
 2020-06-09 12:13:27,545 [planck_2018_lowl.ee] Calling logp now
 2020-06-09 12:13:27,545 [planck_2018_lowl.ee] Got cl = {'ee': array([0.00000000e+00, 0.00000000e+00, 8.88570637e-02, 2.87575008e-02,
       6.93365805e-03, 1.35500346e-03, 3.89543572e-04, 2.55252386e-04,
       2.12223053e-04, 1.75246806e-04, 1.32013035e-04, 9.22617744e-05,
       6.92368332e-05, 6.35618262e-05, 6.78107612e-05, 7.52760320e-05,
       8.25446245e-05, 8.86942832e-05, 9.42929687e-05, 1.00311519e-04,
       1.07226828e-04, 1.14990994e-04, 1.23330598e-04, 1.31928975e-04,
       1.40627917e-04, 1.49413901e-04, 1.58332865e-04, 1.67422170e-04,
       1.76650264e-04, 1.85974675e-04]), 'bb': array([ 0.00000000e+00,  0.00000000e+00, -5.03177912e-09,  2.78055951e-08,
        7.03502256e-08,  1.21584009e-07,  1.80287189e-07,  2.45071316e-07,
        3.14416615e-07,  3.86712661e-07,  4.60301200e-07,  5.33519882e-07,
        6.04745654e-07,  6.72436578e-07,  7.35170910e-07,  7.91682347e-07,
        8.40890486e-07,  8.81925676e-07,  9.14147638e-07,  9.37157391e-07,
        9.50802262e-07,  9.55173923e-07,  9.50599641e-07,  9.37627104e-07,
        9.17003388e-07,  8.89632701e-07,  8.56626278e-07,  8.19142361e-07,
        7.78338632e-07,  7.35468543e-07]), 'pp': array([0.00000000e+00, 0.00000000e+00, 1.17011902e-08, 3.56360975e-09,
       1.47036184e-09, 7.22453571e-10, 3.98247801e-10, 2.38223791e-10,
       1.51464579e-10, 1.00957867e-10, 6.98689585e-11, 4.91361840e-11,
       3.60046913e-11, 2.69636667e-11, 2.05746303e-11, 1.59570984e-11,
       1.25538757e-11, 1.00018945e-11, 8.05876718e-12, 6.55896153e-12,
       5.38708729e-12, 4.46126736e-12, 3.72245994e-12, 3.12746369e-12,
       2.64424851e-12, 2.24853374e-12, 1.92279547e-12, 1.65213682e-12,
       1.42652861e-12, 1.23737431e-12]), 'ell': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])}
 2020-06-09 12:13:27,547 [planck_2018_lowl.ee] Call clik now with vector = [0.00000000e+00 0.00000000e+00 8.88570637e-02 2.87575008e-02
 6.93365805e-03 1.35500346e-03 3.89543572e-04 2.55252386e-04
 2.12223053e-04 1.75246806e-04 1.32013035e-04 9.22617744e-05
 6.92368332e-05 6.35618262e-05 6.78107612e-05 7.52760320e-05
 8.25446245e-05 8.86942832e-05 9.42929687e-05 1.00311519e-04
 1.07226828e-04 1.14990994e-04 1.23330598e-04 1.31928975e-04
 1.40627917e-04 1.49413901e-04 1.58332865e-04 1.67422170e-04
 1.76650264e-04 1.85974675e-04 9.98246259e-01]
 2020-06-09 12:13:27,548 [planck_2018_lowl.ee] Computed log-likelihood = -199.394
lukashergt commented 4 years ago

Wow! I ran the mini-script again and this time I got nans! For the same script I got different output. Here is the output: lukas_bug2_withnan.log. Running vimdiff on this output and the output from my previous post shows that indeed the only things that changed are the time stamps and the Cls.

When I (obviously) ran it yet again, I didn't get the nans anymore. I ran it another few times. Mostly I get no nans, but once in a while I am getting nans. No idea why.

To recapitulate:

1) The low-l EE likelihood seems to fail when there are nans in the Cls. 2) For the same input parameters cobaya sometimes gets nans in the Cls and sometimes doesn't.


Reduced output with nans:

 2020-06-09 12:46:57,809 [model] Got input parameters: {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'N_ncdm': 1, 'N_ur': 2.0328, 'output': 'lCl pCl', 'lensing': 'yes', 'l_max_scalars': 29, 'A_planck': 0.9982462593219659}
 2020-06-09 12:46:57,809 [classy] Got parameters {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544}
 2020-06-09 12:46:57,809 [classy] Computing new state
 2020-06-09 12:46:57,809 [classy] Setting parameters: {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'N_ncdm': 1, 'N_ur': 2.0328, 'output': 'lCl pCl', 'lensing': 'yes', 'l_max_scalars': 29}
 2020-06-09 12:47:05,386 [planck_2018_lowl.ee] Got parameters {'A_planck': 0.9982462593219659}
 2020-06-09 12:47:05,391 [planck_2018_lowl.ee] Computing new state
 2020-06-09 12:47:05,391 [planck_2018_lowl.ee] Calling logp now
 2020-06-09 12:47:05,391 [planck_2018_lowl.ee] Got cl = {'ee': array([ 0.,  0., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan]), 'bb': array([ 0.,  0., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan]), 'pp': array([0.00000000e+00, 0.00000000e+00, 1.17011902e-08, 3.56360975e-09,
       1.47036184e-09, 7.22453571e-10, 3.98247801e-10, 2.38223791e-10,
       1.51464579e-10, 1.00957867e-10, 6.98689585e-11, 4.91361840e-11,
       3.60046913e-11, 2.69636667e-11, 2.05746303e-11, 1.59570984e-11,
       1.25538757e-11, 1.00018945e-11, 8.05876718e-12, 6.55896153e-12,
       5.38708729e-12, 4.46126736e-12, 3.72245994e-12, 3.12746369e-12,
       2.64424851e-12, 2.24853374e-12, 1.92279547e-12, 1.65213682e-12,
       1.42652861e-12, 1.23737431e-12]), 'ell': array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])}
 2020-06-09 12:47:05,393 [planck_2018_lowl.ee] Call clik now with vector = [0.         0.                nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
 0.99824626]
[lukas-amd3950x:317622] *** Process received signal ***
[lukas-amd3950x:317622] Signal: Segmentation fault (11)
[lukas-amd3950x:317622] Signal code: Address not mapped (1)
[lukas-amd3950x:317622] Failing at address: 0x555aeb4a5ad0
[lukas-amd3950x:317622] [ 0] /usr/lib/libc.so.6(+0x3c3e0)[0x7f03e545a3e0]
[lukas-amd3950x:317622] [ 1] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(simall_lkl+0x18a)[0x7f038213d51e]
[lukas-amd3950x:317622] [ 2] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(lklbs_lkl+0x20c)[0x7f03820ea8c4]
[lukas-amd3950x:317622] [ 3] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(distribution_lkl+0x146)[0x7f03820fdac9]
[lukas-amd3950x:317622] [ 4] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(clik_compute+0x5b)[0x7f03820e941b]
[lukas-amd3950x:317622] [ 5] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/python/site-packages/clik/lkl.cpython-38-x86_64-linux-gnu.so(+0x9737)[0x7f03dcfdd737]
[lukas-amd3950x:317622] [ 6] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/python/site-packages/clik/lkl.cpython-38-x86_64-linux-gnu.so(+0x82fa)[0x7f03dcfdc2fa]
[lukas-amd3950x:317622] [ 7] /usr/lib/libpython3.8.so.1.0(_PyObject_MakeTpCall+0x45c)[0x7f03e570c18c]
[lukas-amd3950x:317622] [ 8] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x5108)[0x7f03e5707a78]
[lukas-amd3950x:317622] [ 9] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0xa22)[0x7f03e5701d72]
[lukas-amd3950x:317622] [10] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x19d)[0x7f03e571387d]
[lukas-amd3950x:317622] [11] /usr/lib/libpython3.8.so.1.0(+0x13e867)[0x7f03e5723867]
[lukas-amd3950x:317622] [12] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x324)[0x7f03e5726ee4]
[lukas-amd3950x:317622] [13] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2435)[0x7f03e5704da5]
[lukas-amd3950x:317622] [14] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x304)[0x7f03e5701654]
[lukas-amd3950x:317622] [15] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x19d)[0x7f03e571387d]
[lukas-amd3950x:317622] [16] /usr/lib/libpython3.8.so.1.0(+0x13e867)[0x7f03e5723867]
[lukas-amd3950x:317622] [17] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x324)[0x7f03e5726ee4]
[lukas-amd3950x:317622] [18] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2435)[0x7f03e5704da5]
[lukas-amd3950x:317622] [19] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x304)[0x7f03e5701654]
[lukas-amd3950x:317622] [20] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x19d)[0x7f03e571387d]
[lukas-amd3950x:317622] [21] /usr/lib/libpython3.8.so.1.0(+0x13e867)[0x7f03e5723867]
[lukas-amd3950x:317622] [22] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x324)[0x7f03e5726ee4]
[lukas-amd3950x:317622] [23] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2435)[0x7f03e5704da5]
[lukas-amd3950x:317622] [24] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x304)[0x7f03e5701654]
[lukas-amd3950x:317622] [25] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x19d)[0x7f03e571387d]
[lukas-amd3950x:317622] [26] /usr/lib/libpython3.8.so.1.0(+0x13e867)[0x7f03e5723867]
[lukas-amd3950x:317622] [27] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x324)[0x7f03e5726ee4]
[lukas-amd3950x:317622] [28] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2435)[0x7f03e5704da5]
[lukas-amd3950x:317622] [29] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0xa22)[0x7f03e5701d72]
[lukas-amd3950x:317622] *** End of error message ***
Segmentation fault (core dumped)
lukashergt commented 4 years ago

I've found that I can prevent (or at least heavily reduce) this non-deterministic behaviour (sometimes producing nans and sometimes not) from happening by suppressing all sort of multi-threading, i.e. compiling CLASS without OpenMP and setting all of MKL_NUM_THREADS=1 and OPENBLAS_NUM_THREADS=1 (in addition to the PolyChord standard OMP_NUM_THREADS=1).

Before that I got nans in about 10% of the cases when looping the script posted above.

After compiling without OpenMP and setting those variables to one I did not manage to get nans at all anymore.