deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
162 stars 127 forks source link

kpar calculation error #4783

Open pxlxingliang opened 1 month ago

pxlxingliang commented 1 month ago

Describe the bug

When I set kpar=4 in LCAO calculations, during the SCF, abacus is stopped abnormal:

 * * * * * *
 << Start SCF iteration.
 ITER      TMAG       AMAG        ETOT/eV          EDIFF/eV         DRHO     TIME/s
 GE1      3.15e+01   3.29e+01  -1.01320040e+05   0.00000000e+00   6.8200e-02  25.27
 GE2      3.36e+01   3.44e+01  -1.01333408e+05  -1.33677580e+01   3.2150e-02  23.02
 GE3      3.37e+01   3.50e+01  -1.01335339e+05  -1.93134518e+00   1.7747e-02  23.00
 GE4      3.47e+01   3.60e+01  -1.01335678e+05  -3.38426245e-01   1.0694e-02  22.99
 GE5      3.47e+01   3.66e+01  -1.01335964e+05  -2.86198516e-01   5.7317e-03  22.91
 GE6      3.49e+01   3.67e+01  -1.01335999e+05  -3.54613168e-02   3.6981e-03  23.06
 GE7      3.49e+01   3.68e+01  -1.01335992e+05   7.67498555e-03   2.6695e-03  23.00
 GE8      3.48e+01   3.68e+01  -1.01335997e+05  -4.89171828e-03   1.6578e-03  23.00
 GE9      3.47e+01   3.68e+01  -1.01336001e+05  -4.65089043e-03   3.4829e-04  23.00
 GE10     3.47e+01   3.68e+01  -1.01336001e+05  -1.05952656e-04   2.6453e-04  22.94
 GE11     3.48e+01   3.68e+01  -1.01336001e+05  -8.33600483e-05   1.2761e-04  22.96
 GE12     3.48e+01   3.68e+01  -1.01336001e+05  -1.66829337e-05   5.5800e-05  22.93
 GE13     3.48e+01   3.68e+01  -1.01336001e+05  -2.45205620e-06   3.6140e-05  22.95
 GE14     3.48e+01   3.68e+01  -1.01336001e+05  -1.84446519e-06   2.0335e-05  22.99
 GE15     3.48e+01   3.68e+01  -1.01336001e+05  -4.00123322e-07   1.1107e-05  22.99
 GE16     3.48e+01   3.68e+01  -1.01336001e+05  -2.39393406e-07   6.4698e-06  22.98
 GE17     3.48e+01   3.68e+01  -1.01336001e+05  -9.89573590e-08   4.1572e-06  23.00
 GE18     3.48e+01   3.68e+01  -1.01336001e+05  -2.81515558e-08   2.7041e-06  22.95
 GE19     3.48e+01   3.68e+01  -1.01336001e+05  -1.12234994e-08   1.6258e-06  23.01
 GE20     3.48e+01   3.68e+01  -1.01336001e+05  -2.37586757e-09   8.2842e-07  22.94
 GE21     3.48e+01   3.68e+01  -1.01336001e+05   1.96751533e-09   5.5777e-07  22.97

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 38 RUNNING AT dp-lbg-14037-13933528
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

kpar.zip

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

pxlxingliang commented 1 month ago

I have tested with GNU image, and the calculation is successful:

  * * * * * *
 << Start SCF iteration.
 ITER      TMAG       AMAG        ETOT/eV          EDIFF/eV         DRHO     TIME/s
 GE1      3.15e+01   3.29e+01  -1.01320040e+05   0.00000000e+00   6.8200e-02  41.88
 GE2      3.36e+01   3.44e+01  -1.01333408e+05  -1.33677580e+01   3.2150e-02  38.56
 GE3      3.37e+01   3.50e+01  -1.01335339e+05  -1.93134518e+00   1.7747e-02  38.48
 GE4      3.47e+01   3.60e+01  -1.01335678e+05  -3.38426245e-01   1.0694e-02  38.60
 GE5      3.47e+01   3.66e+01  -1.01335964e+05  -2.86198516e-01   5.7317e-03  38.54
 GE6      3.49e+01   3.67e+01  -1.01335999e+05  -3.54613167e-02   3.6981e-03  38.46
 GE7      3.49e+01   3.68e+01  -1.01335992e+05   7.67498552e-03   2.6695e-03  38.53
 GE8      3.48e+01   3.68e+01  -1.01335997e+05  -4.89171834e-03   1.6578e-03  38.67
 GE9      3.47e+01   3.68e+01  -1.01336001e+05  -4.65089046e-03   3.4829e-04  38.40
 GE10     3.47e+01   3.68e+01  -1.01336001e+05  -1.05954475e-04   2.6453e-04  38.60
 GE11     3.48e+01   3.68e+01  -1.01336001e+05  -8.33565093e-05   1.2761e-04  38.98
 GE12     3.48e+01   3.68e+01  -1.01336001e+05  -1.66848022e-05   5.5800e-05  40.68
 GE13     3.48e+01   3.68e+01  -1.01336001e+05  -2.45179633e-06   3.6140e-05  38.90
 GE14     3.48e+01   3.68e+01  -1.01336001e+05  -1.84424245e-06   2.0335e-05  39.04
 GE15     3.48e+01   3.68e+01  -1.01336001e+05  -4.00383182e-07   1.1107e-05  39.05
 GE16     3.48e+01   3.68e+01  -1.01336001e+05  -2.39467652e-07   6.4698e-06  39.00
 GE17     3.48e+01   3.68e+01  -1.01336001e+05  -9.88088673e-08   4.1572e-06  39.08
 GE18     3.48e+01   3.68e+01  -1.01336001e+05  -2.83371705e-08   2.7041e-06  39.06
 GE19     3.48e+01   3.68e+01  -1.01336001e+05  -1.10255104e-08   1.6258e-06  41.39
 GE20     3.48e+01   3.68e+01  -1.01336001e+05  -2.43773912e-09   8.2842e-07  42.08
 GE21     3.48e+01   3.68e+01  -1.01336001e+05   1.84377223e-09   5.5777e-07  42.15
 GE22     3.48e+01   3.68e+01  -1.01336001e+05  -1.23743102e-11   4.2427e-07  40.10
 GE23     3.48e+01   3.68e+01  -1.01336001e+05  -3.19257204e-09   2.1143e-07  40.38
 GE24     3.48e+01   3.68e+01  -1.01336001e+05   1.10131361e-09   1.0338e-07  43.68
 GE25     3.48e+01   3.68e+01  -1.01336001e+05   7.42458615e-10   8.7599e-08  43.41
 >> Leave SCF iteration.
 * * * * * *
------------------------------
hongriTianqi commented 1 month ago

The error is:

Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2255: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2255: comm->shm_numa_layout[my_numa_node].base_addr
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7f706faaa06c]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f706f453f01]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x38e694) [0x7f706f18e694]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x221c66) [0x7f706f021c66]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x256d8c) [0x7f706f056d8c]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x26d930) [0x7f706f06d930]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x23e7a1) [0x7f706f03e7a1]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x21cce3) [0x7f706f01cce3]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x392daa) [0x7f706f192daa]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(MPI_Bcast+0x417) [0x7f706ef83917]
/opt/intel/oneapi/mkl/latest/lib/libmkl_blacs_intelmpi_lp64.so.2(MKLMPI_Bcast+0x4d) [0x7f708bf72cad]
/opt/intel/oneapi/mkl/latest/lib/libmkl_blacs_intelmpi_lp64.so.2(Czgebs2d+0x133) [0x7f708bf67883]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(PB_CInV+0xa90) [0x7f7074005130]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(PB_Cpsyr2kA+0x2360) [0x7f70740361b0]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(pzher2k_+0xba0) [0x7f70740b6ea0]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(pzhegst_+0x130e) [0x7f7073def6be]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(pzhengst_+0x614) [0x7f7073df1d74]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(pzhegvx_+0xd83) [0x7f7073df0cf3]
abacus() [0x7d476b]
abacus() [0x7d1996]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7fce9b4aa06c]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7fce9ae53f01]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x38e694) [0x7fce9ab8e694]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x221c66) [0x7fce9aa21c66]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x256d8c) [0x7fce9aa56d8c]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x26d930) [0x7fce9aa6d930]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x23e7a1) [0x7fce9aa3e7a1]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x21cce3) [0x7fce9aa1cce3]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x392daa) [0x7fce9ab92daa]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(MPI_Bcast+0x417) [0x7fce9a983917]
/opt/intel/oneapi/mkl/latest/lib/libmkl_blacs_intelmpi_lp64.so.2(MKLMPI_Bcast+0x4d) [0x7fceb789ecad]
/opt/intel/oneapi/mkl/latest/lib/libmkl_blacs_intelmpi_lp64.so.2(Czgebs2d+0x133) [0x7fceb7893883]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(PB_CInV+0xa90) [0x7fce9fa05130]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(PB_Cpsyr2kA+0x2360) [0x7fce9fa361b0]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(pzher2k_+0xba0) [0x7fce9fab6ea0]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(pzhegst_+0x130e) [0x7fce9f7ef6be]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(pzhengst_+0x614) [0x7fce9f7f1d74]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(pzhegvx_+0xd83) [0x7fce9f7f0cf3]
abacus() [0x7d476b]
abacus() [0x7d1996]
abacus() [0x7cf250]
abacus() [0x7cb8cc]
abacus() [0xbfdad7]
abacus() [0xbfbe10]
abacus() [0x931cc2]
abacus() [0x8e9132]
abacus() [0x767988]
abacus() [0x77e2e7]
abacus() [0x7cf250]
abacus() [0x7cb8cc]
abacus() [0xbfdad7]
abacus() [0xbfbe10]
abacus() [0x931cc2]
abacus() [0x8e9132]
abacus() [0x767988]
abacus() [0x77e2e7]
abacus() [0x77d508]
abacus() [0x77ce01]
abacus() [0x43eac6]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f706e9d4d90]
Abort(1) on node 0: Internal error
abacus() [0x77d508]
abacus() [0x77ce01]
abacus() [0x43eac6]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fce9a3d4d90]
Abort(1) on node 5: Internal error

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 15467 RUNNING AT bohrium-585-1174097
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 15468 RUNNING AT bohrium-585-1174097
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 3 PID 15469 RUNNING AT bohrium-585-1174097
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 4 PID 15470 RUNNING AT bohrium-585-1174097
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 6 PID 15472 RUNNING AT bohrium-585-1174097
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 7 PID 15473 RUNNING AT bohrium-585-1174097
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================
hongriTianqi commented 1 month ago

I have tried the following modifications of codes, in order to solve the suspicious place of MPI_Bcast, but failed.

The intel machine output the above error no matter one use genelpa or scalapack_gvx as the ks_solver, but it is always ok with gnu machine

diff --git a/source/module_hsolver/hsolver_lcao.cpp b/source/module_hsolver/hsolver_lcao.cpp
index 1d167612b..315f32b04 100644
--- a/source/module_hsolver/hsolver_lcao.cpp
+++ b/source/module_hsolver/hsolver_lcao.cpp
@@ -406,6 +406,7 @@ void HSolverLCAO<T, Device>::parakSolve(hamilt::Hamilt<T>* pHamilt,
         /// global index of k point
         int ik_global = ik + k2d.get_pKpoints()->startk_pool[k2d.get_my_pool()];
         auto psi_pool = psi::Psi<T>(1, ncol_bands_pool, k2d.get_p2D_pool()->nrow, nullptr);
+        std::vector<double> ekb(nbands, 0.0);
         ModuleBase::Memory::record("HSolverLCAO::psi_pool", nrow * ncol_bands_pool * sizeof(T));
         if (ik_global < psi.get_nk() && ik < k2d.get_pKpoints()->nks_pool[k2d.get_my_pool()])
         {
@@ -416,14 +417,26 @@ void HSolverLCAO<T, Device>::parakSolve(hamilt::Hamilt<T>* pHamilt,
             hamilt::MatrixBlock<T> sk_pool = hamilt::MatrixBlock<T>{k2d.sk_pool.data(),
                 (size_t)k2d.get_p2D_pool()->get_row_size(), (size_t)k2d.get_p2D_pool()->get_col_size(), k2d.get_p2D_po
ol()->desc};
             /// solve eigenvector and eigenvalue for H(k)
-            pdiag_parak->diag_pool(hk_pool, sk_pool, psi_pool,&(pes->ekb(ik_global, 0)), k2d.POOL_WORLD_K2D);
+            pdiag_parak->diag_pool(hk_pool, sk_pool, psi_pool,ekb.data(), k2d.POOL_WORLD_K2D);
         }
         MPI_Barrier(MPI_COMM_WORLD);
         ModuleBase::timer::tick("HSolverLCAO", "collect_psi");
         for (int ipool = 0; ipool < ik_kpar.size(); ++ipool)
         {
             int source = k2d.get_pKpoints()->get_startpro_pool(ipool);
-            MPI_Bcast(&(pes->ekb(ik_kpar[ipool], 0)), nbands, MPI_DOUBLE, source, MPI_COMM_WORLD);
+            //MPI_Bcast(&(pes->ekb(ik_kpar[ipool], 0)), nbands, MPI_DOUBLE, source, MPI_COMM_WORLD);
+           int MY_RANK;
+            std::vector<double> ekb_global(nbands, 0.0);
+            MPI_Comm_rank(MPI_COMM_WORLD, &MY_RANK);
+            if (MY_RANK == source)
+            {
+                std::copy(ekb.begin(), ekb.end(), ekb_global.begin());
+            }
+            MPI_Barrier(MPI_COMM_WORLD);
+            // bcast ekb
+            MPI_Bcast(ekb_global.data(), nbands, MPI_DOUBLE, source, MPI_COMM_WORLD);
+            std::copy(ekb_global.begin(), ekb_global.end(), &(pes->ekb(ik_kpar[ipool], 0)));
+            //MPI_Bcast(&(pes->ekb(ik_kpar[ipool], 0)), nbands, MPI_DOUBLE, source, MPI_COMM_WORLD);
             int desc_pool[9];
             std::copy(k2d.get_p2D_pool()->desc, k2d.get_p2D_pool()->desc + 9, desc_pool);
             if (k2d.get_my_pool() != ipool) {
hongriTianqi commented 1 month ago

I saw a similar error on internet: https://community.intel.com/t5/Intel-MPI-Library/MPI-program-aborts-with-an-quot-Assertion-failed-in-file-ch4-shm/td-p/1370537/page/2

where the problem is solve by updating MPI to version 2023.2;

And I have checked my image, the mai is version 2021.13:


/opt/intel/oneapi/mpi/       
2021.11  2021.13  latest
``

so, it might be helpful to update the mpi version of oneapi on the image.
hongriTianqi commented 1 month ago

I tried new version of oneapi, but still met the same error:

Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2263: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2263: comm->shm_numa_layout[my_numa_node].base_addr
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(MPL_backtrace_show+0x24) [0x7fec4979bd84]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7fec49356f51]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x34b8a0) [0x7fec4914b8a0]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x2623b4) [0x7fec490623b4]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x2610bd) [0x7fec490610bd]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x296a90) [0x7fec49096a90]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x291b4b) [0x7fec49091b4b]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x269003) [0x7fec49069003]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x25e779) [0x7fec4905e779]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x363cff) [0x7fec49163cff]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(MPL_backtrace_show+0x24) [0x7f9d7399bd84]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7f9d73556f51]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x34b8a0) [0x7f9d7334b8a0]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x2623b4) [0x7f9d732623b4]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x2610bd) [0x7f9d732610bd]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x296a90) [0x7f9d73296a90]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x291b4b) [0x7f9d73291b4b]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x269003) [0x7f9d73269003]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x25e779) [0x7f9d7325e779]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x363cff) [0x7f9d73363cff]
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(MPI_Bcast+0x27c) [0x7f9d7315151c]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_blacs_intelmpi_lp64.so.2(MKLMPI_Bcast+0x4d) [0x7f9d984e0cad]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_blacs_intelmpi_lp64.so.2(Czgebs2d+0x133) [0x7f9d984d5883]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(PB_CInV+0xfad) [0x7f9d8060564d]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(pzhemv_+0x823) [0x7f9d80696093]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(pzlatrd_+0x924) [0x7f9d8044b474]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(pzhentrd_+0xd57) [0x7f9d803f3957]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(mkl_pzheevx0_+0x259b) [0x7f9d803ecddb]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(mkl_pzheevxm_+0x94f) [0x7f9d803e986f]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(pzheevx_+0x583) [0x7f9d803e8b53]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(pzhegvx_+0xf4e) [0x7f9d803f0ebe]
abacus() [0x7d3c9b]
abacus() [0x7d0ec6]
abacus() [0x7ce893]
abacus() [0x7cb05c]
abacus() [0xc05707]
abacus() [0xc03a40]
abacus() [0x930102]
abacus() [0x8e8392]
abacus() [0x767118]
abacus() [0x77da77]
abacus() [0x77cc98]
Abort(1) on node 4: Internal error
/opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(MPI_Bcast+0x27c) [0x7fec48f5151c]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_blacs_intelmpi_lp64.so.2(MKLMPI_Bcast+0x4d) [0x7fec6e2dfcad]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_blacs_intelmpi_lp64.so.2(Czgebs2d+0x133) [0x7fec6e2d4883]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(PB_CInV+0xfad) [0x7fec5640564d]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(pzhemv_+0x823) [0x7fec56496093]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(pzlatrd_+0x924) [0x7fec5624b474]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(pzhentrd_+0xd57) [0x7fec561f3957]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(mkl_pzheevx0_+0x259b) [0x7fec561ecddb]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(mkl_pzheevxm_+0x94f) [0x7fec561e986f]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(pzheevx_+0x583) [0x7fec561e8b53]
/opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_lp64.so.2(pzhegvx_+0xf4e) [0x7fec561f0ebe]
abacus() [0x7d3c9b]
abacus() [0x7d0ec6]
abacus() [0x7ce893]
abacus() [0x7cb05c]
abacus() [0xc05707]
abacus() [0xc03a40]
abacus() [0x930102]
abacus() [0x8e8392]
abacus() [0x767118]
abacus() [0x77da77]
abacus() [0x77cc98]
hongriTianqi commented 1 month ago

Using the 2024 version of intel oneapi, and ks_solver genelpa, the persisting error has disappeared, although it is still there with the scalapack_gvx solver.

hongriTianqi commented 1 month ago

by the way, the old intel image delivers the following error, which has disappeared as mentioned above:

Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2255: comm->shm_numa_layout[my_numa_node].base_addr
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2255: comm->shm_numa_layout[my_numa_node].base_addr
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7fac546aa06c]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7fac54053f01]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x38e694) [0x7fac53d8e694]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x221c66) [0x7fac53c21c66]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x256d8c) [0x7fac53c56d8c]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x26d930) [0x7fac53c6d930]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x23e7a1) [0x7fac53c3e7a1]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x21cce3) [0x7fac53c1cce3]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x392daa) [0x7fac53d92daa]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(MPI_Bcast+0x417) [0x7fac53b83917]
/opt/intel/oneapi/mkl/latest/lib/libmkl_blacs_intelmpi_lp64.so.2(MKLMPI_Bcast+0x4d) [0x7fac70a51cad]
/opt/intel/oneapi/mkl/latest/lib/libmkl_blacs_intelmpi_lp64.so.2(Czgebr2d+0x13c) [0x7fac70a4775c]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(PB_CInV+0xc10) [0x7fac58c052b0]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(PB_CptrmmAB+0x5d2) [0x7fac58c3fe32]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(pztrmm_+0xe85) [0x7fac58cbbae5]
abacus() [0xc76812]
abacus() [0xc75ba4]
abacus() [0x7d89e3]
abacus() [0x7cf103]
abacus() [0x7cb8cc]
abacus() [0x930163]
abacus() [0x8e8e98]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(MPL_backtrace_show+0x1c) [0x7fe94b4aa06c]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(MPIR_Assert_fail+0x21) [0x7fe94ae53f01]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x38e694) [0x7fe94ab8e694]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x221c66) [0x7fe94aa21c66]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x256d8c) [0x7fe94aa56d8c]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x26d930) [0x7fe94aa6d930]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x23e7a1) [0x7fe94aa3e7a1]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x21cce3) [0x7fe94aa1cce3]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(+0x392daa) [0x7fe94ab92daa]
/opt/intel/oneapi/mpi/latest/lib/libmpi.so.12(MPI_Bcast+0x417) [0x7fe94a983917]
/opt/intel/oneapi/mkl/latest/lib/libmkl_blacs_intelmpi_lp64.so.2(MKLMPI_Bcast+0x4d) [0x7fe967823cad]
/opt/intel/oneapi/mkl/latest/lib/libmkl_blacs_intelmpi_lp64.so.2(Czgebr2d+0x13c) [0x7fe96781975c]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(PB_CInV+0xc10) [0x7fe94fa052b0]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(PB_CptrmmAB+0x5d2) [0x7fe94fa3fe32]
/opt/intel/oneapi/mkl/latest/lib/libmkl_scalapack_lp64.so.2(pztrmm_+0xe85) [0x7fe94fabbae5]
abacus() [0xc76812]
abacus() [0xc75ba4]
abacus() [0x7d89e3]
abacus() [0x7cf103]
abacus() [0x7cb8cc]
abacus() [0x930163]
abacus() [0x8e8e98]
abacus() [0x767988]
abacus() [0x77e2e7]
abacus() [0x77d508]
abacus() [0x77ce01]
abacus() [0x43eac6]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fe94a3d4d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fe94a3d4e40]
abacus() [0x43e975]
Abort(1) on node 5: Internal error
abacus() [0x767988]
abacus() [0x77e2e7]
abacus() [0x77d508]
abacus() [0x77ce01]
abacus() [0x43eac6]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fac535d4d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fac535d4e40]
abacus() [0x43e975]
hongriTianqi commented 1 month ago

Using the 2024 version of intel oneapi, and ks_solver genelpa, the persisting error has disappeared, although it is still there with the scalapack_gvx solver.

Sorry that one still meets the following error with this setting:

STDERR: abacus() [0x77cc98]
STDERR: abacus() [0x77c591]
STDERR: abacus() [0x43eac6]
STDERR: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x153b425d7d90]
STDERR: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x153b425d7e40]
STDERR: abacus() [0x43e975]
STDERR: Abort(1) on node 12: Internal error
STDERR: Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_bcast.h at line 88: shm_heap_buffer != NULL
STDERR: /opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(MPL_backtrace_show+0x24) [0x154401b9bd84]
STDERR: /opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(MPIR_Assert_fail+0x21) [0x154401756f51]
STDERR: /opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x786ac4) [0x154401986ac4]
STDERR: /opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x7775be) [0x1544019775be]
STDERR: /opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x296c41) [0x154401496c41]
STDERR: /opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x291b4b) [0x154401491b4b]
STDERR: /opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x269003) [0x154401469003]
STDERR: /opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x25e779) [0x15440145e779]
STDERR: /opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(+0x363cff) [0x154401563cff]
STDERR: /opt/intel/oneapi/mpi/2021.13/lib/libmpi.so.12(MPI_Bcast+0x27c) [0x15440135151c]
STDERR: /opt/intel/oneapi/mpi/2021.13/lib/libmpifort.so.12(pmpi_bcast+0x23) [0x1543f98f5db3]
STDERR: /usr/local/lib/libelpa_openmp.so.19(elpa2_compute_mp_trans_ev_tridi_to_band_complex_double_+0x8e09) [0x154426a70749]
STDERR: /usr/local/lib/libelpa_openmp.so.19(elpa2_impl_mp_elpa_solve_evp_complex_2stage_a_h_a_double_impl_+0x771a) [0x154426af1b9a]
STDERR: /usr/local/lib/libelpa_openmp.so.19(elpa_impl_mp_elpa_eigenvectors_a_h_a_dc_+0x192) [0x15442698db42]
STDERR: /usr/local/lib/libelpa_openmp.so.19(elpa_eigenvectors_a_h_a_dc+0xc10) [0x15442698eee0]
STDERR: abacus() [0xc73fde]
STDERR: abacus() [0xc745a6]
STDERR: abacus() [0x7d8173]
STDERR: abacus() [0x7ce893]
STDERR: abacus() [0x7cb05c]
STDERR: abacus() [0xc05707]
STDERR: abacus() [0xc047bd]
STDERR: abacus() [0x930102]
STDERR: abacus() [0x8e8392]
STDERR: abacus() [0x767118]
STDERR: abacus() [0x77da77]
hongriTianqi commented 1 month ago

We have tested a series systems and find that setting kpar = nprocessor is a way to safely use this function now.