deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
162 stars 128 forks source link

Program hangs/crashes #1936

Open PeizeLin opened 1 year ago

PeizeLin commented 1 year ago

Describe the bug

Two layers of graphene, 3268 C atoms. PBE, scf.

When DSIZE = 1, program hangs for more than an hour.

 Warning_Memory_Consuming allocated:  LOC::DM 1.38e+04 MB
 allocate DM , the dimension is 42484
        enter setAlltoallvParameter, nblk = 64
                                     pnum = 0
                                     prow = 0
                                     pcol = 0
                             nRow_in_proc = 42484
                             nCol_in_proc = 42484

When DSIZE > 1, program crashes immediately. (It shows more error message with mpich than intelmpi.) Crashes at these functions:

C_16_k1_PBE.zip

Expected behavior

No response

To Reproduce

No response

Environment

Linux 3.10.0-1160.el7.x86_64, Red Hat 4.8.5-44 icpc 2021.5.0 (gcc 10.2.0) intelmpi 2021.5 / mpich 4.1 mkl 2021.5 elpa_openmp 2021.11.002 cereal 1.3.2

Additional Context

ModuleBase::TITLE() are printed in running_scf.log, with available memory and time consumed.

caic99 commented 1 year ago

Hi @PeizeLin , Would you please first check if OOM error happened.

PeizeLin commented 1 year ago

Hi @PeizeLin , Would you please first check if OOM error happened.

As shown in running_scf.log, the available memory for each node is 180GB when it crashes.

caic99 commented 1 year ago

@PeizeLin These might be related to errors in MPI communications. I noticed that your program will hang; you can try using gdb attach to analyze the cause.

Satinelamp commented 1 year ago

@caic99 If I am using a cluster, I could only login in the master node of the cluster, but I have to submit the job to a computing node. How should I provide the PID number to `gdb attach? Here is what I got when using top command on the master node: image

Satinelamp commented 1 year ago

@PeizeLin I tried the test case with 5 nodes (each node:64 cores and 256GB), but I still got the error of out of memory. Maybe we should try a smaller supercell first so that we can exclude the memory issue.

Here is the log I got:

                              ABACUS v3.2.4

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: unknown

 Wed Jun 28 12:25:22 2023
 MAKE THE DIR         : OUT.ABACUS/
 UNIFORM GRID DIM     : 864 * 864 * 375
 UNIFORM GRID DIM(BIG): 216 * 216 * 125
 DONE(5.09204    SEC) : SETUP UNITCELL
 DONE(5.17242    SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Self-consistent calculations for electrons
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  NBASE       
 1       Gamma           320         42484       
 ---------------------------------------------------------
 Use Systematically Improvable Atomic bases
 ---------------------------------------------------------
 ELEMENT ORBITALS        NBASE       NATOM       XC          
 C       2s2p1d-7au      13          3268        
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
slurmstepd: error: Detected 10 oom-kill event(s) in StepId=4857756.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: h17r4n26: task 38: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=4857756.0
slurmstepd: error: *** STEP 4857756.0 ON h17r4n26 CANCELLED AT 2023-06-28T12:30:29 ***
srun: error: h17r4n30: tasks 256-319: Terminated
srun: error: h17r4n29: tasks 192-255: Terminated
srun: error: h17r4n27: tasks 64-127: Terminated
srun: error: h17r4n28: tasks 128-191: Terminated
caic99 commented 1 year ago

@Satinelamp Please contact your cluster admin for accessing the computing node.

hongriTianqi commented 1 year ago