deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
174 stars 136 forks source link

unregular behaviour while using different cpus combination #5548

Open JTaozhang opened 1 day ago

JTaozhang commented 1 day ago

Describe the bug

Hi there,

Currently, I am working on a WTe2 bilayer systems, which contains about 504 atoms. I try to calculate the band structure with kpoint mesh of 11 along the high symmetric path. The software version is v3.8.2. With the same INPUT setting and KPT settings, but adopting different cpus combinations, one works and another runs abnormal. Specifically it reports nothing, no error and no useful information. one is mpirun -np 8 -env OMP_NUM_THREADS=28 and total cpus is 224(8 nodes, 56 cpus per node), another is mpirun -np 20 -env OMP_NUM_THREADS=28 and total cpus is 560 (10 nodes, 56 cpus per node).

for the abnormal job, the whole outoput information shows below,

                              ABACUS v3.8.2

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: unknown

    Start Time is Wed Nov 20 18:43:57 2024

 ------------------------------------------------------------------------------------

 READING GENERAL INFORMATION

                           global_out_dir = OUT.WTe2/
                           global_in_card = INPUT
                               pseudo_dir = /share/home/zhangtao/work/WTe2/abacus/pseudo/
                              orbital_dir = /share/home/zhangtao/work/WTe2/abacus/orbital/

I don't know what causes this abnormal behavior, could you test the code? I think the parallel calculation part may still possess some unstable problem. this problem I also dicussed in the wechat online group, somebody suggest me to propose an issue here. So I do this.

related file is here WTe2.zip

Expected behavior

the second submitting setting should work fast than the first setting.

To Reproduce

  1. calculate the charge density files according to the INPUT_scf, and KPT_scf;
  2. after getting the charge density files, do the band calculation according to INPUT and KPT.
  3. compare the results and check the code.

Environment

module load cmake/cmake-3.25 gnu/12.1.0

source /share/apps/intel2022/setvars.sh source /share/home/zhangtao/software/abacus-develop-3.8.3/toolchain/install/setup

Additional Context

no more information is needed

Task list for Issue attackers (only for developers)

QuantumMisaka commented 9 hours ago

Hi @JTaozhang From my experience, this problem is from your job submitting scripts and the server setting. but the provided files do not contain any of information about them, please provide them in detail. I've done parallel computation with HSE functional by using OMP_NUM_THREADS=16 mpirun -np 32 abacus and it works well

JTaozhang commented 2 hours ago

Hi, thanks for your reply. I have attached the submitting scripts here, for the combination of "mpirun -np 8 -env OMP_NUM_THREADS=28 and total cpus is 224(8 nodes, 56 cpus per node)", it works. However, for the "mpirun -np 20 -env OMP_NUM_THREADS=28 and total cpus is 560 (10 nodes, 56 cpus per node)", it fails.

I think ,different machine has different setting, I am not sure you can reproduce my case with your machine. Maybe you can change your combination using my atomic system to check this problem.

one more question, less tasks in a node means that the memory of the node will be less shared by other task, right? The mp_num_thread decides how the cpus are distributed to one task, which governs the parallel computing.

20-28.zip

Best, Tao