deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
174 stars 136 forks source link

Interrupt in `Charge_Mixing::mix_rho` #5532

Closed PeizeLin closed 1 day ago

PeizeLin commented 3 days ago

Describe the bug

128 He atoms PBE scf nspin=1 Interrupt in Charge_Mixing::mix_rho at the first step.

1 MPI * 56 OpenMP Remaining memory is 194 GB

ABACUS version: 2024.11.13-dcff74dbeb

He_PBE_128_k1.zip

Expected behavior

No response

To Reproduce

No response

Environment

Additional Context

No response

Task list for Issue attackers (only for developers)

mohanchen commented 3 days ago

Could you check old versions of ABACUS, such as 3.4, 3.5, 3.6, 3.7?

WHUweiqingzhou commented 2 days ago

@PeizeLin PR #5508 fixed a serious memory leak in FFT, could you try again?

WHUweiqingzhou commented 2 days ago

BTW, I tried to reproduce your issue, but failed. Could you debug your calculation with GDB?


                              ABACUS v3.8.3

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: 264915205 (Wed Nov 20 08:53:18 2024 +0800)

 Wed Nov 20 13:40:33 2024
 MAKE THE DIR         : OUT.ABACUS/
 RUNNING WITH DEVICE  : CPU / Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
 UNIFORM GRID DIM        : 375 * 375 * 2160
 UNIFORM GRID DIM(BIG)   : 75 * 75 * 540
 DONE(3.84129    SEC) : SETUP UNITCELL
 DONE(5.239      SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Self-consistent calculations for electrons
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  THREADS     NBASE       
 1       1               32          32          640         
 ---------------------------------------------------------
 Use Systematically Improvable Atomic bases
 ---------------------------------------------------------
 ELEMENT ORBITALS        NBASE       NATOM       XC          
 He      2s1p-6au        5           128         
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(7.58819    SEC) : INIT PLANEWAVE
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
 START CHARGE      : atomic
 DONE(136.831    SEC) : INIT SCF
 ITER       ETOT/eV          EDIFF/eV         DRHO     TIME/s
 GE1     -9.61443788e+03   0.00000000e+00   4.3042e-02  19.98
 GE2     -9.61463134e+03  -1.93460842e-01   6.3655e-03  20.05
 GE3     -9.61464143e+03  -1.00895202e-02   1.9757e-03  20.15
 GE4     -9.61464588e+03  -4.45173080e-03   1.0249e-03  20.21
 GE5     -9.61464973e+03  -3.85078496e-03   6.7370e-04  20.28
 GE6     -9.61465098e+03  -1.25104934e-03   4.6435e-04  20.36
 GE7     -9.61465159e+03  -6.07939182e-04   3.4548e-04  20.44
 GE8     -9.61465184e+03  -2.47790231e-04   2.3762e-04  20.49
 GE9     -9.61465198e+03  -1.42980558e-04   1.4803e-04  20.46
 GE10    -9.61465203e+03  -4.75686398e-05   8.3014e-05  20.43
 GE11    -9.61465207e+03  -3.84768551e-05   5.5287e-05  20.42
 GE12    -9.61465207e+03  -5.74759178e-06   2.7248e-05  20.42
 GE13    -9.61465207e+03  -7.90944256e-07   1.5309e-05  20.41
A-006 commented 2 days ago

Due to my mistake, the memory in FFT was not properly released. I have corrected this part of the code in the new PR. Can it run now? Let me check.I don't have this issue on my machine. Could you please check if you can reproduce it?"

dyzheng commented 2 days ago

I have found the bug of this Issue: in file mixing_data.cpp and line https://github.com/deepmodeling/abacus-develop/blob/develop/source/module_base/module_mixing/mixing_data.cpp#L28

    if (ndim * length > 0)
    {
        this->data = malloc(ndim * length * type_size);
    }

In this case of Issue, the variables would have values of

ndim = 8
length = 303750000

and the INT value of "ndim * length" is 2.43e9 > 2.147e9 (INT limit)

The solution is use size_t rather than int in this code.

WHUweiqingzhou commented 1 day ago

Fixed by #5545