Program hangs/crashes - Githubissues

PeizeLin commented 1 year ago

Describe the bug

Two layers of graphene, 3268 C atoms. PBE, scf.

When DSIZE = 1, program hangs for more than an hour.

 Warning_Memory_Consuming allocated:  LOC::DM 1.38e+04 MB
 allocate DM , the dimension is 42484
        enter setAlltoallvParameter, nblk = 64
                                     pnum = 0
                                     prow = 0
                                     pcol = 0
                             nRow_in_proc = 42484
                             nCol_in_proc = 42484

When DSIZE > 1, program crashes immediately. (It shows more error message with mpich than intelmpi.) Crashes at these functions:

DSIZE==2:

LCAO ALGORITHM --------------- ION=   1  ELEC=   1--------------------------------
==> HSolverLCAO::solve 180.446 GB  246.753 s
==> HamiltLCAO::updateHk   180.446 GB  246.753 s
==> OperatorLCAO::init 180.446 GB  246.753 s
==> Overlap::contributeHR  180.446 GB  246.883 s
==> LCAO_gen_fixedH::calculate_S_no    180.446 GB  246.883 s
==> LCAO_gen_fixedH::build_ST_new  180.446 GB  246.883 s
==> Ekinetic<OperatorLCAO>::contributeHR   180.446 GB  247.228 s
==> LCAO_gen_fixedH::calculate_T_no    180.446 GB  247.228 s
==> LCAO_gen_fixedH::build_ST_new  180.446 GB  247.228 s
==> Nonlocal<OperatorLCAO>::contributeHR   180.446 GB  247.573 s
==> LCAO_gen_fixedH::calculate_NL_no   180.446 GB  247.573 s
==> LCAO_gen_fixedH::b_NL_beta_new 180.446 GB  247.573 s

DSIZE==4 and DSIZE==8:

LCAO ALGORITHM --------------- ION=   1  ELEC=   1--------------------------------
==> HSolverLCAO::solve 200.219 GB  181.235 s
==> HamiltLCAO::updateHk   200.219 GB  181.235 s
==> OperatorLCAO::init 200.219 GB  181.235 s
==> Overlap::contributeHR  200.219 GB  181.304 s
==> LCAO_gen_fixedH::calculate_S_no    200.219 GB  181.304 s
==> LCAO_gen_fixedH::build_ST_new  200.219 GB  181.304 s
==> Ekinetic<OperatorLCAO>::contributeHR   200.219 GB  181.497 s
==> LCAO_gen_fixedH::calculate_T_no    200.219 GB  181.497 s
==> LCAO_gen_fixedH::build_ST_new  200.219 GB  181.497 s
==> Nonlocal<OperatorLCAO>::contributeHR   200.219 GB  181.689 s
==> LCAO_gen_fixedH::calculate_NL_no   200.219 GB  181.689 s
==> LCAO_gen_fixedH::b_NL_beta_new 200.219 GB  181.689 s
==> OperatorLCAO::init 200.166 GB  218.081 s
==> Veff::contributeHk 200.166 GB  218.081 s
==> Gint_interface::cal_gint_vlocal    186.694 GB  220.366 s

And the error messages are:

[proxy:0:2@node021] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.1/src/pm/hydra/proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:2@node021] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.1/src/pm/hydra/lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@node021] main (../../../../mpich-4.1/src/pm/hydra/proxy/pmip.c:127): demux engine error waiting for event
[proxy:0:5@node039] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.1/src/pm/hydra/proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:5@node039] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.1/src/pm/hydra/lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:5@node039] main (../../../../mpich-4.1/src/pm/hydra/proxy/pmip.c:127): demux engine error waiting for event
[proxy:0:3@node026] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.1/src/pm/hydra/proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:3@node026] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.1/src/pm/hydra/lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:3@node026] main (../../../../mpich-4.1/src/pm/hydra/proxy/pmip.c:127): demux engine error waiting for event
[proxy:0:7@node060] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.1/src/pm/hydra/proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:7@node060] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.1/src/pm/hydra/lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:7@node060] main (../../../../mpich-4.1/src/pm/hydra/proxy/pmip.c:127): demux engine error waiting for event
[proxy:0:6@node048] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.1/src/pm/hydra/proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:6@node048] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.1/src/pm/hydra/lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:6@node048] main (../../../../mpich-4.1/src/pm/hydra/proxy/pmip.c:127): demux engine error waiting for event
[proxy:0:4@node027] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.1/src/pm/hydra/proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:4@node027] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.1/src/pm/hydra/lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:4@node027] main (../../../../mpich-4.1/src/pm/hydra/proxy/pmip.c:127): demux engine error waiting for event
[proxy:0:0@node009] HYD_pmcd_pmip_control_cmd_cb (../../../../mpich-4.1/src/pm/hydra/proxy/pmip_cb.c:480): assert (!closed) failed
[proxy:0:0@node009] HYDT_dmxu_poll_wait_for_event (../../../../mpich-4.1/src/pm/hydra/lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@node009] main (../../../../mpich-4.1/src/pm/hydra/proxy/pmip.c:127): demux engine error waiting for event
srun: error: node026: task 3: Exited with exit code 7
srun: error: node021: task 2: Exited with exit code 7
srun: error: node039: task 5: Exited with exit code 7
srun: error: node048: task 6: Exited with exit code 7
srun: error: node060: task 7: Exited with exit code 7
srun: error: node027: task 4: Exited with exit code 7
srun: error: node009: task 0: Exited with exit code 7
[mpiexec@node009] HYDT_bscu_wait_for_completion (../../../../mpich-4.1/src/pm/hydra/lib/tools/bootstrap/utils/bscu_wait.c:109): one of the processes terminated badly; aborting
[mpiexec@node009] HYDT_bsci_wait_for_completion (../../../../mpich-4.1/src/pm/hydra/lib/tools/bootstrap/src/bsci_wait.c:21): launcher returned error waiting for completion
[mpiexec@node009] HYD_pmci_wait_for_completion (../../../../mpich-4.1/src/pm/hydra/mpiexec/pmiserv_pmci.c:197): launcher returned error waiting for completion
[mpiexec@node009] main (../../../../mpich-4.1/src/pm/hydra/mpiexec/mpiexec.c:247): process manager error waiting for completion

C_16_k1_PBE.zip

Expected behavior

No response

To Reproduce

No response

Environment

Linux 3.10.0-1160.el7.x86_64, Red Hat 4.8.5-44 icpc 2021.5.0 (gcc 10.2.0) intelmpi 2021.5 / mpich 4.1 mkl 2021.5 elpa_openmp 2021.11.002 cereal 1.3.2

Additional Context

ModuleBase::TITLE() are printed in running_scf.log, with available memory and time consumed.

caic99 commented 1 year ago

Hi @PeizeLin , Would you please first check if OOM error happened.

PeizeLin commented 1 year ago

Hi @PeizeLin , Would you please first check if OOM error happened.

As shown in running_scf.log, the available memory for each node is 180GB when it crashes.

caic99 commented 1 year ago

@PeizeLin These might be related to errors in MPI communications. I noticed that your program will hang; you can try using gdb attach to analyze the cause.

Satinelamp commented 1 year ago

@caic99 If I am using a cluster, I could only login in the master node of the cluster, but I have to submit the job to a computing node. How should I provide the PID number to `gdb attach? Here is what I got when using top command on the master node:

Satinelamp commented 1 year ago

@PeizeLin I tried the test case with 5 nodes (each node:64 cores and 256GB), but I still got the error of out of memory. Maybe we should try a smaller supercell first so that we can exclude the memory issue.

Here is the log I got:

                              ABACUS v3.2.4

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: unknown

 Wed Jun 28 12:25:22 2023
 MAKE THE DIR         : OUT.ABACUS/
 UNIFORM GRID DIM     : 864 * 864 * 375
 UNIFORM GRID DIM(BIG): 216 * 216 * 125
 DONE(5.09204    SEC) : SETUP UNITCELL
 DONE(5.17242    SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Self-consistent calculations for electrons
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  NBASE       
 1       Gamma           320         42484       
 ---------------------------------------------------------
 Use Systematically Improvable Atomic bases
 ---------------------------------------------------------
 ELEMENT ORBITALS        NBASE       NATOM       XC          
 C       2s2p1d-7au      13          3268        
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
slurmstepd: error: Detected 10 oom-kill event(s) in StepId=4857756.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: h17r4n26: task 38: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=4857756.0
slurmstepd: error: *** STEP 4857756.0 ON h17r4n26 CANCELLED AT 2023-06-28T12:30:29 ***
srun: error: h17r4n30: tasks 256-319: Terminated
srun: error: h17r4n29: tasks 192-255: Terminated
srun: error: h17r4n27: tasks 64-127: Terminated
srun: error: h17r4n28: tasks 128-191: Terminated

caic99 commented 1 year ago

@Satinelamp Please contact your cluster admin for accessing the computing node.

hongriTianqi commented 1 year ago

[x] Verify the issue is not a duplicate.
[x] Describe the bug.
[ ] Steps to reproduce.
[ ] Expected behavior.
[ ] Error message.
[ ] Environment details.
[ ] Additional context.
[ ] Assign a priority level (low, medium, high, urgent).
[ ] Assign the issue to a team member.
[ ] Label the issue with relevant tags.
[ ] Identify possible related issues.
[ ] Create a unit test or automated test to reproduce the bug (if applicable).
[ ] Fix the bug.
[ ] Test the fix.
[ ] Update documentation (if necessary).
[ ] Close the issue and inform the reporter (if applicable).

deepmodeling / abacus-develop

Program hangs/crashes #1936

Describe the bug

Expected behavior

To Reproduce

Environment

Additional Context