Open lukashergt opened 4 years ago
Noted, thanks! Will test very soon.
Just to confirm, the point that you are testing is
{'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.09356700887015484, '100*theta_s': 1.0409, 'omega_b': 0.019917603023102904, 'omega_cdm': 0.04158160052940407, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'N_ncdm': 1, 'N_ur': 2.0328, 'output': 'pCl lCl', 'lensing': 'yes', 'l_max_scalars': 29}
{'A_planck': 0.9964039923851173}
right?
Yes, that is how I have interpreted the process rank 15
output by the error... That is the parameter set that I've tested directly with Class and with an additional Cobaya run (all parameters fixed and a dummy parameter sampled). In both cases I was not able to reproduce the error.
Thanks, I'll give it a try.
If you are unable to reproduce it, it does sound ulgy: may be due to persistence of some memory allocation between clik calls. We'll see...
nan
I sprinkled some additional self.log.debug
statements (in _planck_clik_prototype.py
's log_likelihood
function just before the call to clik) and got some more helpful output:
test_EE_omegak_3d_debug_with_nans.log
Search for [21 :
from the bottom then you'll find the last debug statements from rank 21 that show that the error seems to be related to nan
s being passed to clik. Here is the reduced output:
2020-06-09 11:24:21,866 [21 : model] Posterior to be computed for parameters {'Omega_k': -0.14091940813704676, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'A_planck': 0.9982462593219659}
2020-06-09 11:24:21,866 [21 : prior] Evaluating prior at array([-0.14091941, 0.02335145, 0.02863166, 0.99824626])
2020-06-09 11:24:21,867 [21 : prior] Got logpriors = [11.95388244568207]
2020-06-09 11:24:21,867 [21 : model] Got input parameters: {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'A_planck': 0.9982462593219659}
2020-06-09 11:24:21,867 [21 : classy] Got parameters {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544}
2020-06-09 11:24:21,867 [21 : classy] Computing new state
2020-06-09 11:24:21,867 [21 : classy] Setting parameters: {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'N_ncdm': 1, 'N_ur': 2.0328, 'output': 'lCl pCl', 'lensing': 'yes', 'l_max_scalars': 29}
2020-06-09 11:24:48,001 [21 : planck_2018_lowl.ee] Got parameters {'A_planck': 0.9982462593219659}
2020-06-09 11:24:48,001 [21 : planck_2018_lowl.ee] Computing new state
2020-06-09 11:24:48,002 [21 : planck_2018_lowl.ee] Calling logp now
2020-06-09 11:24:48,002 [21 : planck_2018_lowl.ee] Got cl = {'ee': array([ 0., 0., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan]), 'bb': array([ 0., 0., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan]), 'pp': array([0.00000000e+00, 0.00000000e+00, 1.17011902e-08, 3.56360975e-09,
1.47036184e-09, 7.22453571e-10, 3.98247801e-10, 2.38223791e-10,
1.51464579e-10, 1.00957867e-10, 6.98689585e-11, 4.91361840e-11,
3.60046913e-11, 2.69636667e-11, 2.05746303e-11, 1.59570984e-11,
1.25538757e-11, 1.00018945e-11, 8.05876718e-12, 6.55896153e-12,
5.38708729e-12, 4.46126736e-12, 3.72245994e-12, 3.12746369e-12,
2.64424851e-12, 2.24853374e-12, 1.92279547e-12, 1.65213682e-12,
1.42652861e-12, 1.23737431e-12]), 'ell': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])}
2020-06-09 11:24:48,008 [21 : planck_2018_lowl.ee] Call clik now with vector = [0. 0. nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
0.99824626]
--------------------------------------------------------------------------
mpirun noticed that process rank 21 with PID 0 on node lukas-amd3950x exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
I've tested the parameter set from rank 21 with the following script, where the .yaml
file corresponds to the input from my first post.
from cobaya.model import get_model
from cobaya.input import load_input
yaml = load_input('test_omegak_EE_3d.yaml')
model = get_model(yaml)
params = {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'N_ncdm': 1, 'N_ur': 2.0328, 'output': 'lCl pCl', 'lensing': 'yes', 'l_max_scalars': 29, 'A_planck': 0.9982462593219659}
model.logps(params)
This is the corresponding output file: lukas_bug2.log. Same input parameters, but this time no nans in the Cls!!! Is the above script a bad representation of what happens in an actual cobaya run? Is the debug output not accurate enough? Here is the reduced output:
2020-06-09 12:13:19,745 [model] Got input parameters: {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'N_ncdm': 1, 'N_ur': 2.0328, 'output': 'lCl pCl', 'lensing': 'yes', 'l_max_scalars': 29, 'A_planck': 0.9982462593219659}
2020-06-09 12:13:19,745 [classy] Got parameters {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544}
2020-06-09 12:13:19,746 [classy] Computing new state
2020-06-09 12:13:19,746 [classy] Setting parameters: {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'N_ncdm': 1, 'N_ur': 2.0328, 'output': 'lCl pCl', 'lensing': 'yes', 'l_max_scalars': 29}
2020-06-09 12:13:27,545 [planck_2018_lowl.ee] Got parameters {'A_planck': 0.9982462593219659}
2020-06-09 12:13:27,545 [planck_2018_lowl.ee] Computing new state
2020-06-09 12:13:27,545 [planck_2018_lowl.ee] Calling logp now
2020-06-09 12:13:27,545 [planck_2018_lowl.ee] Got cl = {'ee': array([0.00000000e+00, 0.00000000e+00, 8.88570637e-02, 2.87575008e-02,
6.93365805e-03, 1.35500346e-03, 3.89543572e-04, 2.55252386e-04,
2.12223053e-04, 1.75246806e-04, 1.32013035e-04, 9.22617744e-05,
6.92368332e-05, 6.35618262e-05, 6.78107612e-05, 7.52760320e-05,
8.25446245e-05, 8.86942832e-05, 9.42929687e-05, 1.00311519e-04,
1.07226828e-04, 1.14990994e-04, 1.23330598e-04, 1.31928975e-04,
1.40627917e-04, 1.49413901e-04, 1.58332865e-04, 1.67422170e-04,
1.76650264e-04, 1.85974675e-04]), 'bb': array([ 0.00000000e+00, 0.00000000e+00, -5.03177912e-09, 2.78055951e-08,
7.03502256e-08, 1.21584009e-07, 1.80287189e-07, 2.45071316e-07,
3.14416615e-07, 3.86712661e-07, 4.60301200e-07, 5.33519882e-07,
6.04745654e-07, 6.72436578e-07, 7.35170910e-07, 7.91682347e-07,
8.40890486e-07, 8.81925676e-07, 9.14147638e-07, 9.37157391e-07,
9.50802262e-07, 9.55173923e-07, 9.50599641e-07, 9.37627104e-07,
9.17003388e-07, 8.89632701e-07, 8.56626278e-07, 8.19142361e-07,
7.78338632e-07, 7.35468543e-07]), 'pp': array([0.00000000e+00, 0.00000000e+00, 1.17011902e-08, 3.56360975e-09,
1.47036184e-09, 7.22453571e-10, 3.98247801e-10, 2.38223791e-10,
1.51464579e-10, 1.00957867e-10, 6.98689585e-11, 4.91361840e-11,
3.60046913e-11, 2.69636667e-11, 2.05746303e-11, 1.59570984e-11,
1.25538757e-11, 1.00018945e-11, 8.05876718e-12, 6.55896153e-12,
5.38708729e-12, 4.46126736e-12, 3.72245994e-12, 3.12746369e-12,
2.64424851e-12, 2.24853374e-12, 1.92279547e-12, 1.65213682e-12,
1.42652861e-12, 1.23737431e-12]), 'ell': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])}
2020-06-09 12:13:27,547 [planck_2018_lowl.ee] Call clik now with vector = [0.00000000e+00 0.00000000e+00 8.88570637e-02 2.87575008e-02
6.93365805e-03 1.35500346e-03 3.89543572e-04 2.55252386e-04
2.12223053e-04 1.75246806e-04 1.32013035e-04 9.22617744e-05
6.92368332e-05 6.35618262e-05 6.78107612e-05 7.52760320e-05
8.25446245e-05 8.86942832e-05 9.42929687e-05 1.00311519e-04
1.07226828e-04 1.14990994e-04 1.23330598e-04 1.31928975e-04
1.40627917e-04 1.49413901e-04 1.58332865e-04 1.67422170e-04
1.76650264e-04 1.85974675e-04 9.98246259e-01]
2020-06-09 12:13:27,548 [planck_2018_lowl.ee] Computed log-likelihood = -199.394
Wow! I ran the mini-script again and this time I got nans! For the same script I got different output. Here is the output: lukas_bug2_withnan.log. Running vimdiff on this output and the output from my previous post shows that indeed the only things that changed are the time stamps and the Cls.
When I (obviously) ran it yet again, I didn't get the nans anymore. I ran it another few times. Mostly I get no nans, but once in a while I am getting nans. No idea why.
1) The low-l EE likelihood seems to fail when there are nans in the Cls. 2) For the same input parameters cobaya sometimes gets nans in the Cls and sometimes doesn't.
Reduced output with nans:
2020-06-09 12:46:57,809 [model] Got input parameters: {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'N_ncdm': 1, 'N_ur': 2.0328, 'output': 'lCl pCl', 'lensing': 'yes', 'l_max_scalars': 29, 'A_planck': 0.9982462593219659}
2020-06-09 12:46:57,809 [classy] Got parameters {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544}
2020-06-09 12:46:57,809 [classy] Computing new state
2020-06-09 12:46:57,809 [classy] Setting parameters: {'A_s': 2.101e-09, 'n_s': 0.9649, 'Omega_k': -0.14091940813704676, '100*theta_s': 1.0409, 'omega_b': 0.023351445930932017, 'omega_cdm': 0.028631663112625154, 'm_ncdm': 0.06, 'tau_reio': 0.0544, 'N_ncdm': 1, 'N_ur': 2.0328, 'output': 'lCl pCl', 'lensing': 'yes', 'l_max_scalars': 29}
2020-06-09 12:47:05,386 [planck_2018_lowl.ee] Got parameters {'A_planck': 0.9982462593219659}
2020-06-09 12:47:05,391 [planck_2018_lowl.ee] Computing new state
2020-06-09 12:47:05,391 [planck_2018_lowl.ee] Calling logp now
2020-06-09 12:47:05,391 [planck_2018_lowl.ee] Got cl = {'ee': array([ 0., 0., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan]), 'bb': array([ 0., 0., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan]), 'pp': array([0.00000000e+00, 0.00000000e+00, 1.17011902e-08, 3.56360975e-09,
1.47036184e-09, 7.22453571e-10, 3.98247801e-10, 2.38223791e-10,
1.51464579e-10, 1.00957867e-10, 6.98689585e-11, 4.91361840e-11,
3.60046913e-11, 2.69636667e-11, 2.05746303e-11, 1.59570984e-11,
1.25538757e-11, 1.00018945e-11, 8.05876718e-12, 6.55896153e-12,
5.38708729e-12, 4.46126736e-12, 3.72245994e-12, 3.12746369e-12,
2.64424851e-12, 2.24853374e-12, 1.92279547e-12, 1.65213682e-12,
1.42652861e-12, 1.23737431e-12]), 'ell': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])}
2020-06-09 12:47:05,393 [planck_2018_lowl.ee] Call clik now with vector = [0. 0. nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
0.99824626]
[lukas-amd3950x:317622] *** Process received signal ***
[lukas-amd3950x:317622] Signal: Segmentation fault (11)
[lukas-amd3950x:317622] Signal code: Address not mapped (1)
[lukas-amd3950x:317622] Failing at address: 0x555aeb4a5ad0
[lukas-amd3950x:317622] [ 0] /usr/lib/libc.so.6(+0x3c3e0)[0x7f03e545a3e0]
[lukas-amd3950x:317622] [ 1] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(simall_lkl+0x18a)[0x7f038213d51e]
[lukas-amd3950x:317622] [ 2] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(lklbs_lkl+0x20c)[0x7f03820ea8c4]
[lukas-amd3950x:317622] [ 3] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(distribution_lkl+0x146)[0x7f03820fdac9]
[lukas-amd3950x:317622] [ 4] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/libclik.so(clik_compute+0x5b)[0x7f03820e941b]
[lukas-amd3950x:317622] [ 5] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/python/site-packages/clik/lkl.cpython-38-x86_64-linux-gnu.so(+0x9737)[0x7f03dcfdd737]
[lukas-amd3950x:317622] [ 6] /home/hergtl/Documents/Projects/PlanckPrj/planck_2018/code/plc_3.0/plc-3.01/lib/python/site-packages/clik/lkl.cpython-38-x86_64-linux-gnu.so(+0x82fa)[0x7f03dcfdc2fa]
[lukas-amd3950x:317622] [ 7] /usr/lib/libpython3.8.so.1.0(_PyObject_MakeTpCall+0x45c)[0x7f03e570c18c]
[lukas-amd3950x:317622] [ 8] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x5108)[0x7f03e5707a78]
[lukas-amd3950x:317622] [ 9] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0xa22)[0x7f03e5701d72]
[lukas-amd3950x:317622] [10] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x19d)[0x7f03e571387d]
[lukas-amd3950x:317622] [11] /usr/lib/libpython3.8.so.1.0(+0x13e867)[0x7f03e5723867]
[lukas-amd3950x:317622] [12] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x324)[0x7f03e5726ee4]
[lukas-amd3950x:317622] [13] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2435)[0x7f03e5704da5]
[lukas-amd3950x:317622] [14] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x304)[0x7f03e5701654]
[lukas-amd3950x:317622] [15] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x19d)[0x7f03e571387d]
[lukas-amd3950x:317622] [16] /usr/lib/libpython3.8.so.1.0(+0x13e867)[0x7f03e5723867]
[lukas-amd3950x:317622] [17] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x324)[0x7f03e5726ee4]
[lukas-amd3950x:317622] [18] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2435)[0x7f03e5704da5]
[lukas-amd3950x:317622] [19] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x304)[0x7f03e5701654]
[lukas-amd3950x:317622] [20] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x19d)[0x7f03e571387d]
[lukas-amd3950x:317622] [21] /usr/lib/libpython3.8.so.1.0(+0x13e867)[0x7f03e5723867]
[lukas-amd3950x:317622] [22] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x324)[0x7f03e5726ee4]
[lukas-amd3950x:317622] [23] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2435)[0x7f03e5704da5]
[lukas-amd3950x:317622] [24] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x304)[0x7f03e5701654]
[lukas-amd3950x:317622] [25] /usr/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x19d)[0x7f03e571387d]
[lukas-amd3950x:317622] [26] /usr/lib/libpython3.8.so.1.0(+0x13e867)[0x7f03e5723867]
[lukas-amd3950x:317622] [27] /usr/lib/libpython3.8.so.1.0(PyObject_Call+0x324)[0x7f03e5726ee4]
[lukas-amd3950x:317622] [28] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x2435)[0x7f03e5704da5]
[lukas-amd3950x:317622] [29] /usr/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0xa22)[0x7f03e5701d72]
[lukas-amd3950x:317622] *** End of error message ***
Segmentation fault (core dumped)
I've found that I can prevent (or at least heavily reduce) this non-deterministic behaviour (sometimes producing nans and sometimes not) from happening by suppressing all sort of multi-threading, i.e. compiling CLASS without OpenMP and setting all of MKL_NUM_THREADS=1
and OPENBLAS_NUM_THREADS=1
(in addition to the PolyChord standard OMP_NUM_THREADS=1
).
Before that I got nans in about 10% of the cases when looping the script posted above.
After compiling without OpenMP and setting those variables to one I did not manage to get nans at all anymore.
Hi @JesusTorrado,
sorry to bring this up again, but I never quite managed to fix the remaining problem in #34. The original issue described there was related to a memory leak, which got mostly fixed. However, the segfault described later on persists. I've got a new computing setup and can now test these things locally with a gnu build and can say that this is not an intel issue, I get this problem with both gnu and intel. But with this new setup I was able to get information on which rank/process causes the issue.
This error comes up when running
Cobaya
withPolyChord
,Class
,lowEE
likelihood andOmega_k
varied. I never got it for flat LCDM and I never got it when I excluded the lowEE likelihood.Below I'm posting the
.yaml
file I used, the error message and an overview of my computing setup. I am also attaching the complete output file including debug output.I have tested the parameter set that seems to have caused the error (from rank 15). Running that parameter set directly with Class causes no errors. I've also tried to fix all those parameters in the .yaml file (while running over a dummy variable) and again did not get any errors. This has me at a loss as to what might be happening.
Full output with debug
test_EE_omegak_3d_debug_witherr.log
Example
.yaml
file (reduced to 3 varying parameters):Error message:
My setup: