cmbant / CosmoMC

MCMC parameter sampling code
https://cosmologist.info/cosmomc/
82 stars 68 forks source link

MPI_Init error while running CosmoMC on Cori@NERSC #21

Closed kumasura closed 5 years ago

kumasura commented 5 years ago

I have successfully compiled the CosmoMC on cori@nersc. However, while running "mpirun -np 1 ./cosmomc test_planck.ini", it crashes with following output:

kumasura@cori09:~/CosmoMC-Nov2016> mpirun -np 1 ./cosmomc test_planck.ini [Thu Apr 25 08:19:11 2019] [unknown] Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(537): MPID_Init(246).......: channel initialization failed MPID_Init(638).......: PMI2 init failed: 1 forrtl: error (76): Abort trap signal Image PC Routine Line Source
cosmomc 00000000006A7A84 for__signal_handl Unknown Unknown libpthread-2.22.s 00002AAAB16D1C10 Unknown Unknown Unknown libc-2.22.so 00002AAAB1912F67 gsignal Unknown Unknown libc-2.22.so 00002AAAB191433A abort Unknown Unknown libmpich_intel.so 00002AAAB0A66998 Unknown Unknown Unknown libmpich_intel.so 00002AAAB09EFA32 MPIR_Handle_fatal Unknown Unknown libmpich_intel.so 00002AAAB09EFB26 MPIR_Err_return_c Unknown Unknown libmpich_intel.so 00002AAAB09746B4 MPI_Init Unknown Unknown libmpich_intel.so 00002AAAB09C1A07 MPI_INIT Unknown Unknown cosmomc 00000000005A2C33 Unknown Unknown Unknown cosmomc 0000000000410FDE Unknown Unknown Unknown libc-2.22.so 00002AAAB18FE725 __libc_start_main Unknown Unknown cosmomc 0000000000410EE9 Unknown Unknown Unknown

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 0 with PID 0 on node cori09 exited on signal 6 (Aborted).

The modules loaded in the environment are :

kumasura@cori09:~/CosmoMC-Nov2016> module list Currently Loaded Modulefiles: 1) modules/3.2.10.6 2) nsg/1.2.0 3) intel/18.0.1.163 4) craype-network-aries 5) craype/2.5.15 6) cray-libsci/18.07.1 7) udreg/2.3.2-6.0.7.1_5.13g5196236.ari 8) ugni/6.0.14.0-6.0.7.1_3.13__gea11d3d.ari 9) pmi/5.0.14 10) dmapp/7.1.1-6.0.7.1_5.45g5a674e0.ari 11) gni-headers/5.0.12.0-6.0.7.1_3.11g3b1768f.ari 12) xpmem/2.2.15-6.0.7.1_5.11__g7549d06.ari 13) job/2.2.3-6.0.7.1_5.43g6c4e934.ari 14) dvs/2.7_2.2.118-6.0.7.1_10.1g58b37a2 15) alps/6.6.43-6.0.7.1_5.45__ga796da32.ari 16) rca/2.2.18-6.0.7.1_5.47g2aa4f39.ari 17) atp/2.1.3 18) PrgEnv-intel/6.0.4 19) craype-haswell 20) cray-mpich/7.7.3 21) altd/2.0 22) darshan/3.1.4 23) openmpi/3.1.3 kumasura@cori09:~/CosmoMC-Nov2016>

cmbant commented 5 years ago

I doubt this is a cosmomc problem, can you run other MPI programs from the command line like that on Cori?

kumasura commented 5 years ago

No, I am not able to run other programs as well. However, i tried to run the same with srun as there was a note on nersc web portal stating that there is no 'mpirun' command – which is used by many MPI implementations – on Cori. But even with srun I am getting following error.

Number of MPI processes: 2 file_root:test Random seeds: 13219, 10873 rand_inst: 1 Random seeds: 13325, 10874 rand_inst: 2 Using clik with likelihood file ./data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik TT from l=0 to l= 2508 Clik will run with the following nuisance parameters: A_cib_217 cib_index xi_sz_cib A_sz ps_A_100_100 ps_A_143_143 ps_A_143_217 ps_A_217_217 ksz_norm gal545_A_100 gal545_A_143 gal545_A_143_217 gal545_A_217 calib_100T calib_217T A_planck Using clik with likelihood file ./data/clik/low_l/bflike/lowl_SMW_70_dx11d_2014_10_03_v5c_Ap.clik TT from l=0 to l= 2508 forrtl: severe (257): formatted I/O to unit open for unformatted transfers, unit 42, file /global/u2/k/kumasura/plc_2.0/low_l/bflike/lowl_SMW_70_dx11d_2014_10_03_v5c_Ap.clik/clik/lkl_0/_external/.//params_bflike.ini Image PC Routine Line Source
cosmomc 000000000066991E forio_return Unknown Unknown libifcoremt.so.5 00002AAAB453870B for_read_seq_nml Unknown Unknown libclik.so 00002AAAB215535B bflike_smw_mp_ini Unknown Unknown libclik.so 00002AAAB2113CD3 bflike_smwextra Unknown Unknown libclik.so 00002AAAB20F16E7 clik_bflike_smw_i Unknown Unknown libclik.so 00002AAAB20BD710 clik_lklobject_in Unknown Unknown libclik.so 00002AAAB20B4ED3 clik_init Unknown Unknown libclik_f90.so 00002AAAAACD1754 fortran_clik_init Unknown Unknown libclik_f90.so 00002AAAAACD54A4 clik_mp_clik_init Unknown Unknown cosmomc 000000000050762A Unknown Unknown Unknown cosmomc 0000000000504D8A Unknown Unknown Unknown cosmomc 000000000055A21A Unknown Unknown Unknown cosmomc 000000000058F04D Unknown Unknown Unknown cosmomc 000000000059847A Unknown Unknown Unknown cosmomc 0000000000410E5E Unknown Unknown Unknown libc-2.22.so 00002AAAB18FE725 __libc_start_main Unknown Unknown cosmomc 0000000000410D69 Unknown Unknown Unknown forrtl: severe (257): formatted I/O to unit open for unformatted transfers, unit 42, file /global/u2/k/kumasura/plc_2.0/low_l/bflike/lowl_SMW_70_dx11d_2014_10_03_v5c_Ap.clik/clik/lkl_0/_external/.//params_bflike.ini Image PC Routine Line Source
cosmomc 000000000066991E for
io_return Unknown Unknown libifcoremt.so.5 00002AAAB453870B for_read_seq_nml Unknown Unknown libclik.so 00002AAAB215535B bflike_smw_mp_ini Unknown Unknown libclik.so 00002AAAB2113CD3 bflike_smwextra Unknown Unknown libclik.so 00002AAAB20F16E7 clik_bflike_smw_i Unknown Unknown libclik.so 00002AAAB20BD710 clik_lklobject_in Unknown Unknown libclik.so 00002AAAB20B4ED3 clik_init Unknown Unknown libclik_f90.so 00002AAAAACD1754 fortran_clik_init Unknown Unknown libclik_f90.so 00002AAAAACD54A4 clik_mp_clik_init Unknown Unknown cosmomc 000000000050762A Unknown Unknown Unknown cosmomc 0000000000504D8A Unknown Unknown Unknown cosmomc 000000000055A21A Unknown Unknown Unknown cosmomc 000000000058F04D Unknown Unknown Unknown cosmomc 000000000059847A Unknown Unknown Unknown cosmomc 0000000000410E5E Unknown Unknown Unknown libc-2.22.so 00002AAAB18FE725 __libc_start_main Unknown Unknown cosmomc 0000000000410D69 Unknown Unknown Unknown

clik version 723c1a4b0580 smica Checking likelihood './data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik' on test data. got -380.979 expected -380.979 (diff -8.68135e-09)


clik version 723c1a4b0580 smica Checking likelihood './data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik' on test data. got -380.979 expected -380.979 (diff -8.68135e-09)

srun: error: nid01291: task 1: Exited with exit code 1 srun: Terminating job step 20792771.0 srun: error: nid01290: task 0: Exited with exit code 1

I tried it with CosmoMC 2015.

Please let me know if it is compilation error or I am missing something.

cmbant commented 5 years ago

I don't know what the issue is, but it looks like it is inside clik code rather than inside CosmoMC code, so closing for the now since you seem to have posted a duplicate issue on CosmoCoffee.

ajcosmology commented 2 years ago

No, I am not able to run other programs as well. However, i tried to run the same with srun as there was a note on nersc web portal stating that there is no 'mpirun' command – which is used by many MPI implementations – on Cori. But even with srun I am getting following error.

Number of MPI processes: 2

file_root:test Random seeds: 13219, 10873 rand_inst: 1 Random seeds: 13325, 10874 rand_inst: 2 Using clik with likelihood file ./data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik TT from l=0 to l= 2508 Clik will run with the following nuisance parameters: A_cib_217 cib_index xi_sz_cib A_sz ps_A_100_100 ps_A_143_143 ps_A_143_217 ps_A_217_217 ksz_norm gal545_A_100 gal545_A_143 gal545_A_143_217 gal545_A_217 calib_100T calib_217T A_planck Using clik with likelihood file ./data/clik/low_l/bflike/lowl_SMW_70_dx11d_2014_10_03_v5c_Ap.clik TT from l=0 to l= 2508 forrtl: severe (257): formatted I/O to unit open for unformatted transfers, unit 42, file /global/u2/k/kumasura/plc_2.0/low_l/bflike/lowl_SMW_70_dx11d_2014_10_03_v5c_Ap.clik/clik/lkl_0/_external/.//params_bflike.ini Image PC Routine Line Source cosmomc 000000000066991E forio_return Unknown Unknown libifcoremt.so.5 00002AAAB453870B for_read_seq_nml Unknown Unknown libclik.so 00002AAAB215535B bflike_smw_mp_ini Unknown Unknown libclik.so 00002AAAB2113CD3 bflike_smwextra Unknown Unknown libclik.so 00002AAAB20F16E7 clik_bflike_smw_i Unknown Unknown libclik.so 00002AAAB20BD710 clik_lklobject_in Unknown Unknown libclik.so 00002AAAB20B4ED3 clik_init Unknown Unknown libclik_f90.so 00002AAAAACD1754 fortran_clik_init Unknown Unknown libclik_f90.so 00002AAAAACD54A4 clik_mp_clik_init Unknown Unknown cosmomc 000000000050762A Unknown Unknown Unknown cosmomc 0000000000504D8A Unknown Unknown Unknown cosmomc 000000000055A21A Unknown Unknown Unknown cosmomc 000000000058F04D Unknown Unknown Unknown cosmomc 000000000059847A Unknown Unknown Unknown cosmomc 0000000000410E5E Unknown Unknown Unknown libc-2.22.so 00002AAAB18FE725 __libc_start_main Unknown Unknown cosmomc 0000000000410D69 Unknown Unknown Unknown forrtl: severe (257): formatted I/O to unit open for unformatted transfers, unit 42, file /global/u2/k/kumasura/plc_2.0/low_l/bflike/lowl_SMW_70_dx11d_2014_10_03_v5c_Ap.clik/clik/lkl_0/_external/.//params_bflike.ini Image PC Routine Line Source cosmomc 000000000066991E forio_return Unknown Unknown libifcoremt.so.5 00002AAAB453870B for_read_seq_nml Unknown Unknown libclik.so 00002AAAB215535B bflike_smw_mp_ini Unknown Unknown libclik.so 00002AAAB2113CD3 bflike_smwextra Unknown Unknown libclik.so 00002AAAB20F16E7 clik_bflike_smw_i Unknown Unknown libclik.so 00002AAAB20BD710 clik_lklobject_in Unknown Unknown libclik.so 00002AAAB20B4ED3 clik_init Unknown Unknown libclik_f90.so 00002AAAAACD1754 fortran_clik_init Unknown Unknown libclik_f90.so 00002AAAAACD54A4 clik_mp_clik_init Unknown Unknown cosmomc 000000000050762A Unknown Unknown Unknown cosmomc 0000000000504D8A Unknown Unknown Unknown cosmomc 000000000055A21A Unknown Unknown Unknown cosmomc 000000000058F04D Unknown Unknown Unknown cosmomc 000000000059847A Unknown Unknown Unknown cosmomc 0000000000410E5E Unknown Unknown Unknown libc-2.22.so 00002AAAB18FE725 __libc_start_main Unknown Unknown cosmomc 0000000000410D69 Unknown Unknown Unknown

clik version 723c1a4b0580

smica Checking likelihood './data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik' on test data. got -380.979 expected -380.979 (diff -8.68135e-09)

clik version 723c1a4b0580

smica Checking likelihood './data/clik/hi_l/plik/plik_dx11dr2_HM_v18_TT.clik' on test data. got -380.979 expected -380.979 (diff -8.68135e-09)

srun: error: nid01291: task 1: Exited with exit code 1 srun: Terminating job step 20792771.0 srun: error: nid01290: task 0: Exited with exit code 1

I tried it with CosmoMC 2015.

Please let me know if it is compilation error or I am missing something.

Did you get it resolved? I am facing same problem and don't know how to resolve it. Can you please share the solution? Thanks, Alex