deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
164 stars 128 forks source link

HSE segmentataion fault in large system #5016

Closed QuantumMisaka closed 2 weeks ago

QuantumMisaka commented 3 weeks ago

Describe the bug

When running LibRI HSE exx_separate_loop 1 in some FeCx systems, there will be error and the calculation cannot be in running

(base) [2201110432@wm2-login01 test2]$ cat abacus.err 
[l08c54n1:524864:0:524864] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[l08c52n4:417396:0:417396] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
[l08c53n3:1797393:0:1797393] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)

Attachments: Fe2C-HSE-RI02.tar.gz

Expected behavior

HSE / PBE0 running normally

To Reproduce

in Attachments

Environment

Additional Context

@PeizeLin @maki49 Any advice ?

Task list for Issue attackers (only for developers)

QuantumMisaka commented 3 weeks ago

Added: PBE0 and ABACUS 3.7.4 :Commit: 93badfa87 (Wed Aug 28 10:18:22 2024 +0800) will have the same problem

maki49 commented 3 weeks ago

On my machine it will be killed at cal_datas in cal_Cs_dCs (the same happens with cal_force 1 and cal_stress 0). It may be either OOM or some bug in cal_Cs_dCs. @PeizeLin it may need your help. For your information:

 ==> Exx_LRI::cal_exx_ions  51 GB   196 s
 ==> LRI_CV::cal_Vs 51 GB   196 s
 ==> LRI_CV::cal_datas  51 GB   196 s
 ==> LRI_CV::cal_dVs    48 GB   208 s
 ==> LRI_CV::cal_datas  48 GB   208 s
 ==> LRI_CV::cal_Cs_dCs 12.3 GB 340 s
 ==> LRI_CV::cal_datas  12.3 GB 340 s
QuantumMisaka commented 2 weeks ago

@maki49 @PeizeLin In my machine, same error will occor even set cal_force 0 and cal_stress 0

FYI: in running_scf.log

 SETUP SEARCHING RADIUS FOR PROGRAM TO SEARCH ADJACENT ATOMS
                  longest orb rcut (Bohr) = 8
   longest nonlocal projector rcut (Bohr) = 2.16
              searching radius is (Bohr)) = 20.3
         searching radius unit is (Bohr)) = 1.89

 SETUP EXTENDED REAL SPACE GRID FOR GRID INTEGRATION
                          real space grid = [ 80, 80, 72 ]
                 big cell numbers in grid = [ 16, 16, 24 ]
             meshcell numbers in big cell = [ 5, 5, 3 ]
                        extended fft grid = [ 11, 11, 18 ]
                dimension of extened grid = [ 39, 39, 61 ]
                            UnitCellTotal = 27
              Atom number in sub-FFT-grid = 24
    Local orbitals number in sub-FFT-grid = 536
                                ParaV.nnr = 1343892
                                     nnrg = 2838200

 Warning_Memory_Consuming allocated:  Gint::hRGint 21.9 MB

 Warning_Memory_Consuming allocated:  Gint::DMRGint 43.8 MB

 Warning_Memory_Consuming allocated:  pvpR_reduced 43.3 MB
QuantumMisaka commented 2 weeks ago

I've done some test and found that this bug emerged in somewhere between Commit: a33935612 (Thu Jun 27 16:40:42 2024 +0800) and Commit: 58126a8e6 (Mon Jul 15 21:45:33 2024 +0800)

I used LibRI-loop3 (in gitee) and LibComm 0.1.1 in these version

QuantumMisaka commented 2 weeks ago

@maki49 I've done some test and confirm that Commit 740bf8e4ecd9f847751bcb000f89eb6367075d31 has no problem but commit 8a1f0125ae8714ab763efcb28a7f2f436e03e722 has problem, may you do some check?

Dependencies: LibRI: loop3 in gitee (early version) LibComm: 0.1.1 ELPA: 2024.03.001 Intel-OneAPI: 2023.0.0 Hardware: Intel 8358 64core 1024G mem

maki49 commented 2 weeks ago

It is an out-of-memory.
In 8a1f012, about 0.2~0.3 GB more memory is allocated for H(R) in the constructor of OperatorEXX, which leads to your problem.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

commit | exx_real_number | free memory before `cal_Cs_dCs`(GB) | free memory after `cal_Cs_dCs`(GB) -- | -- | -- | -- 740bf8e | 0 | 36 | boom 740bf8e | 1 | 42.2 | 17.3 8a1f012 | 0 | 35.8 | boom 8a1f012 | 1 | 41.9 | 17.1

You can try exx_real_number 1 where both 2 commits run normally on my machine.

QuantumMisaka commented 2 weeks ago

@maki49 I consider that it is not an OOM issue (but it truly has OOM characteristics?) I use -DDEBUG_INFO=ON and run HSE in Si supercell system in nspin 1. This error will occur when using 3x3x3 supercell of Si (contain 54 atoms) But this error will not occur when using 2x3x3 supercell of Si (contrain 36 atoms)

Attachments: HSE-debug-Si.tar.gz

Notes; memory information in Si-333 running_scf.log

 SETUP SEARCHING RADIUS FOR PROGRAM TO SEARCH ADJACENT ATOMS
                  longest orb rcut (Bohr) = 7
   longest nonlocal projector rcut (Bohr) = 3.64
 ==> atom_arrange::search       996 GB  9.46 s
              searching radius is (Bohr)) = 21.3
         searching radius unit is (Bohr)) = 1.89
 ==> Atom_input::Atom_input     996 GB  9.46 s
 ==> Atom_input::Expand_Grid    996 GB  9.46 s
 ==> Atom_input::calculate_cells        996 GB  9.46 s
 ==> SLTK_Grid::init    996 GB  9.46 s
 ==> SLTK_Grid::setMemberVariables      996 GB  9.46 s
 ==> SLTK_Grid::Build_Cell      996 GB  9.46 s
 ==> SLTK_Grid::Build_Hash_Table        996 GB  9.46 s
 ==> SLTK_Grid::Fold_Hash_Table 996 GB  9.46 s
 ==> Grid_Technique::init       996 GB  9.46 s

 SETUP EXTENDED REAL SPACE GRID FOR GRID INTEGRATION
                          real space grid = [ 144, 144, 144 ]
                 big cell numbers in grid = [ 48, 48, 48 ]
             meshcell numbers in big cell = [ 3, 3, 3 ]
 ==> Grid_MeshCell::init_latvec 996 GB  9.46 s
 ==> Grid_BigCell::init_big_latvec      996 GB  9.46 s
 ==> Grid_BigCell::init_grid_expansion  996 GB  9.46 s
                        extended fft grid = [ 20, 20, 20 ]
                dimension of extened grid = [ 89, 89, 89 ]
 ==> Grid_MeshK::cal_extended_cell      996 GB  9.46 s
                            UnitCellTotal = 27
 ==> Grid_BigCell::init_tau_in_bigcell  996 GB  9.46 s
 ==> Grid_MeshBall::init_meshball       996 GB  9.46 s
 ==> Grid_Technique::init_atoms_on_grid 996 GB  9.48 s
 ==> Grid_Technique::get_startind       996 GB  9.48 s
 ==> Grid_BigCell::grid_expansion_index 996 GB  9.48 s
 ==> Grid_Techinique::init_atoms_on_grid2       996 GB  9.62 s
 ==> Grid_BigCell::grid_expansion_index 996 GB  9.62 s
 ==> Grid_Technique::cal_trace_lo       996 GB  9.63 s
              Atom number in sub-FFT-grid = 54
    Local orbitals number in sub-FFT-grid = 702
 ==> Record_adj::for_2d 996 GB  9.63 s
                                ParaV.nnr = 205874
 ==> LCAO_nnr::cal_nnrg 996 GB  9.65 s
 ==> LCAO_nnr::cal_max_box_index        996 GB  9.65 s
                                     nnrg = 793962
 ==> LCAO_domain::grid_prepare  996 GB  9.66 s
 ==> Gint_k::prep_grid  996 GB  9.66 s
 ==> Potential::pot_register    996 GB  9.66 s
 ==> Potential::get_pot_type    996 GB  9.66 s
 ==> Potential::get_pot_type    996 GB  9.66 s
 ==> Potential::get_pot_type    996 GB  9.66 s
 ==> Veff::initialize_HR        996 GB  9.66 s
 ==> Gint::initialize_pvpR      996 GB  9.66 s

 Warning_Memory_Consuming allocated:  Gint::hRGint 6.78 MB

 Warning_Memory_Consuming allocated:  Gint::DMRGint 6.68 MB
 ==> Gint_k::destroy_pvpR       996 GB  9.66 s
 ==> Gint_k::allocate_pvpR      996 GB  9.66 s

 Warning_Memory_Consuming allocated:  pvpR_reduced 6.06 MB
 ==> OverlapNew::initialize_SR  996 GB  9.66 s
 ==> EkineticNew::initialize_HR 996 GB  9.66 s
 ==> NonlocalNew::initialize_HR 996 GB  9.67 s
 ==> OperatorEXX::OperatorEXX   996 GB  9.67 s
maki49 commented 2 weeks ago

Oops, you've found a real bug. It happens when the system is large enough (NLOCAL>500) where the larger 2d block size makes some processors have no elements of (0,0) atom pair. It was my wrong usage of HContainer. You can (help me) try my branch in #5028 to check whether it has been fixed. (Something will be hidden by OOM on my machine...)

QuantumMisaka commented 2 weeks ago

@maki49 I've tested your branch, it is passed in Si-333 supercell, but it cannot run normally in Si-344 and Si-444 supercell, the stdout is like:

                              ABACUS v3.7.4

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: 69e3487ee (Sat Aug 31 02:05:43 2024 +0800)

 Sat Aug 31 12:04:35 2024
 MAKE THE DIR         : OUT.ABACUS/
 RUNNING WITH DEVICE  : CPU / Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
 dft_functional readin is: hse
 dft_functional in pseudopot file is: PBE
 Please make sure this is what you need
 UNIFORM GRID DIM        : 192 * 192 * 192
 UNIFORM GRID DIM(BIG)   : 48 * 48 * 48
 DONE(0.406771   SEC) : SETUP UNITCELL
 DONE(0.420193   SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Self-consistent calculations for electrons
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  NBASE       
 1       8               2           1664        
 ---------------------------------------------------------
 Use Systematically Improvable Atomic bases
 ---------------------------------------------------------
 ELEMENT ORBITALS        NBASE       NATOM       XC          
 Si      2s2p1d-7au      13          128         
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(0.501615   SEC) : INIT PLANEWAVE
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
 START CHARGE      : atomic
 DONE(89.6475    SEC) : INIT SCF
 * * * * * *
 << Start SCF iteration.
 ITER       ETOT/eV          EDIFF/eV         DRHO     TIME/s
 GE1     -1.37007761e+04   0.00000000e+00   1.4898e-01   3.87
 GE2     -1.37027690e+04  -1.99294319e+00   1.9532e-02   3.42
 GE3     -1.37027710e+04  -2.00335539e-03   2.2260e-03   3.39
 GE4     -1.37027710e+04  -2.72010587e-05   2.6650e-05   3.41
 GE5     -1.37027710e+04  -8.98390392e-08   2.5130e-06   3.42
 Updating EXX and rerun SCF     2.263e+01 (s)
 GE0     -1.37027710e+04  -3.68445088e-09   4.3185e-07  26.07

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 1041406 RUNNING AT l07c80n1
=   KILLED BY SIGNAL: 6 (Aborted)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 1041407 RUNNING AT l07c80n1
=   KILLED BY SIGNAL: 6 (Aborted)
===================================================================================

The stderr message:

abacus: /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hamilt_lcao/module_hcontainer/base_matrix.cpp:127: void hamilt::BaseMatrix<double>::add_element(int, int, const T &) [T = double]: Assertion `this->value_begin != nullptr' failed.
abacus: /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hamilt_lcao/module_hcontainer/base_matrix.cpp:127: void hamilt::BaseMatrix<double>::add_element(int, int, const T &) [T = double]: Assertion `this->value_begin != nullptr' failed.

Attachments:

Test-Si-HSE-69e348.tar.gz

maki49 commented 2 weeks ago

In your new examples, some atom pairs are missing in HContianer because they are not adjacent, which should be added for EXX only if they are in the current processor by 2d-block division. Please try my new fix in #5028.

QuantumMisaka commented 2 weeks ago

@maki49 EXX in Si-444 and larger systems have no problem , but for Si-778 systems (contain 784 atoms) will lead to segmentation fault by another problem (also I do not think it is OOM)

==== backtrace (tid: 138369) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000000254159 elpa2_compute_mp_trans_ev_band_to_full_complex_double_()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2024.03.001/build_cpu/manually_preprocessed_.._src_elpa2_elpa2_compute.F90-src_elpa2_.libs_libelpa_openmp_private_la-elpa2_compute.o.F90:15626
 2 0x00000000003717aa elpa2_impl_mp_elpa_solve_evp_complex_2stage_a_h_a_double_impl_()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2024.03.001/build_cpu/manually_preprocessed_.._src_elpa2_elpa2.F90-src_elpa2_.libs_libelpa_openmp_private_la-elpa2.o.F90:6441
 3 0x00000000000c512f elpa_impl_mp_elpa_eigenvectors_a_h_a_dc_()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2024.03.001/build_cpu/manually_preprocessed_.._src_elpa_impl.F90-src_.libs_libelpa_openmp_private_la-elpa_impl.o.F90:5570
 4 0x00000000000c4709 elpa_eigenvectors_a_h_a_dc()  /lustre/home/2201110432/apps/abacus/toolchain_used/toolchain-icx/build/elpa-2024.03.001/build_cpu/manually_preprocessed_.._src_elpa_impl.F90-src_.libs_libelpa_openmp_private_la-elpa_impl.o.F90:5706
 5 0x0000000000bde2e2 elpa_eigenvectors()  /lustre/home/2201110432/lib/elpa/2024.03.001-icx/cpu/include/elpa/elpa_generic.h:82
 6 0x0000000000bde8ae ELPA_Solver::generalized_eigenvector()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/genelpa/elpa_new_complex.cpp:130
 7 0x00000000007641c3 hsolver::DiagoElpa<std::complex<double> >::diag()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/diago_elpa.cpp:90
 8 0x00000000007641c3 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string()  /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:519
 9 0x00000000007641c3 hsolver::DiagoElpa<std::complex<double> >::diag()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/diago_elpa.cpp:95
10 0x000000000075c3d1 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::hamiltSolvePsiK()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/hsolver_lcao.cpp:149
11 0x000000000075c3d1 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::hamiltSolvePsiK()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/hsolver_lcao.cpp:150
12 0x000000000075a7d1 hsolver::HSolverLCAO<std::complex<double>, base_device::DEVICE_CPU>::solve()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_hsolver/hsolver_lcao.cpp:104
13 0x00000000008ba78f ModuleESolver::ESolver_KS_LCAO<std::complex<double>, double>::hamilt2density()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_esolver/esolver_ks_lcao.cpp:713
14 0x00000000008ba78f ???()  /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:215
15 0x00000000008ba78f ???()  /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:224
16 0x00000000008ba78f std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string()  /usr/lib/gcc/x86_64-redhat-linux/8/../../../../include/c++/8/bits/basic_string.h:661
17 0x00000000008ba78f ModuleESolver::ESolver_KS_LCAO<std::complex<double>, double>::hamilt2density()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_esolver/esolver_ks_lcao.cpp:713
18 0x000000000085b0f9 ModuleESolver::ESolver_KS<std::complex<double>, base_device::DEVICE_CPU>::runner()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_esolver/esolver_ks.cpp:449
19 0x00000000006f9265 Relax_Driver::relax_driver()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_relax/relax_driver.cpp:49
20 0x000000000070f442 Driver::driver_run()  /lustre/home/2201110432/apps/abacus/abacus-test/source/driver_run.cpp:68
21 0x000000000070f442 Relax_Driver::~Relax_Driver()  /lustre/home/2201110432/apps/abacus/abacus-test/source/module_relax/relax_driver.h:14
22 0x000000000070f442 Driver::driver_run()  /lustre/home/2201110432/apps/abacus/abacus-test/source/driver_run.cpp:69
23 0x000000000070e665 Driver::atomic_world()  /lustre/home/2201110432/apps/abacus/abacus-test/source/driver.cpp:186
24 0x000000000070df5e Driver::init()  /lustre/home/2201110432/apps/abacus/abacus-test/source/driver.cpp:40
25 0x00000000004359e6 main()  ???:0
26 0x000000000003ad85 __libc_start_main()  ???:0
27 0x000000000043589e _start()  ???:0
=================================

Attachments:

Si-778-fail.tar.gz

Notes: for Si-777 (contain 686 atoms), HSE running normally

maki49 commented 2 weeks ago

PBE with genelpa also fails on my machine. You can try ks_solver scalapack_gvx.

QuantumMisaka commented 2 weeks ago

In my machine PBE will not fail, but if change ks_solver from genelpa to scalapack_gvx, HSE calculation can be done in Si-888 (contain 1024 atoms) system

It seems EXX part do not have problem, the problem is in ELPA calculation

@caic99 Any comments ?