deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
173 stars 132 forks source link

DCU calculation error (Device) #4063

Closed pxlxingliang closed 5 months ago

pxlxingliang commented 6 months ago

Describe the bug

The dcu daily test at 0427, one example (005) has the below error before SCF:

Invalid address access: 0x4ab0ba402000, Error code: 1.

>>>>>>>> KERNEL VMFault !!!! <<<<<<

>>>>>>>> PID: 2872 !!!! <<<<<<
=========> STREAM <0x33fba80>: VMFault HSA QUEUE ANALYSIS <=========
STREAM <0x33fba80>: get hsa queue W/R ptr: write index: 0, read index: 0
STREAM <0x33fba80>: FAILED: hsa queue is null!
=========> STREAM <0x35f5b10>: VMFault HSA QUEUE ANALYSIS <=========
STREAM <0x35f5b10>: get hsa queue W/R ptr: write index: 2, read index: 0
STREAM <0x35f5b10>: >>>>>>>> DUMP KERNEL AQL PACKET <<<<<<<<<
STREAM <0x35f5b10>: header: 770
STREAM <0x35f5b10>: setup: 3
STREAM <0x35f5b10>: workgroup: x:256, y:1, z:1
STREAM <0x35f5b10>: grid: x:47460352, y:1, z:1
STREAM <0x35f5b10>: group_segment_size: 0
STREAM <0x35f5b10>: private_segment_size: 0
STREAM <0x35f5b10>: kernel_object: 46914453789440

SUCCESS: FIND SAME KERNEL OBJECT COMMAND IN USE LIST. useIdx: 0
STREAM <0x35f5b10>: >>>>>>>> FIND MATCH KERNEL COMMAND <<<<<<<<<
STREAM <0x35f5b10>: kernel name: _ZN3psi6memory11cast_memoryIddEEvPSt7complexIT_EPKS2_IT0_Ei
STREAM <0x35f5b10>: >>>>>>>> DUMP KERNEL ARGS: size: 20 <<<<<<<<<

00 00 c0 8c b0 2a 00 00 00 00 40 ba b0 2a 00 00 
24 2f d4 02 

STREAM <0x35f5b10>: >>>>>>>> DUMP KERNEL ARGS PTR INFO <<<<<<<<<
STREAM <0x35f5b10>: ptr arg index: 0, ptr: 0x2ab08cc00000
STREAM <0x35f5b10>: host ptr: 0x2ab08cc00000, device ptr: 0x2ab08cc00000, unaligned ptr: 0x2ab08cc00000
STREAM <0x35f5b10>: size byte: 759362112
STREAM <0x35f5b10>: ptr arg index: 1, ptr: 0x2ab0ba400000
STREAM <0x35f5b10>: host ptr: 0x2ab0ba400000, device ptr: 0x2ab0ba400000, unaligned ptr: 0x2ab0ba400000
STREAM <0x35f5b10>: size byte: 759362112

=========> STREAM <0x355a2a0>: VMFault HSA QUEUE ANALYSIS <=========
STREAM <0x355a2a0>: get hsa queue W/R ptr: write index: 0, read index: 0
STREAM <0x355a2a0>: FAILED: hsa queue is null!
=========> STREAM <0x34bea30>: VMFault HSA QUEUE ANALYSIS <=========
STREAM <0x34bea30>: get hsa queue W/R ptr: write index: 0, read index: 0
STREAM <0x34bea30>: FAILED: hsa queue is null!

>>>>>>>> KERNEL VMFault Analysis END !!!! <<<<<<

[b03r3n11:02872] *** Process received signal ***
[b03r3n11:02872] Signal: Aborted (6)
[b03r3n11:02872] Signal code:  (-6)
[b03r3n11:02872] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2aaab287b5d0]
[b03r3n11:02872] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2aaabb913207]
[b03r3n11:02872] [ 2] /lib64/libc.so.6(abort+0x148)[0x2aaabb9148f8]
[b03r3n11:02872] [ 3] /public/software/compiler/rocm/dtk-22.10/lib/libgalaxyhip.so.5(+0x98e7d4)[0x2aaab361a7d4]
[b03r3n11:02872] [ 4] /public/software/compiler/rocm/dtk-22.10/lib/libgalaxyhip.so.5(+0x98d0fe)[0x2aaab36190fe]
[b03r3n11:02872] [ 5] /public/software/compiler/rocm/dtk-22.10/lib/libgalaxyhip.so.5(+0x952086)[0x2aaab35de086]
[b03r3n11:02872] [ 6] /lib64/libpthread.so.0(+0x7dd5)[0x2aaab2873dd5]
[b03r3n11:02872] [ 7] /lib64/libc.so.6(clone+0x6d)[0x2aaabb9daead]
[b03r3n11:02872] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 2872 on node b03r3n11 exited on signal 6 (Aborted).

The job is stopped at:

                              ABACUS v3.6.2

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: 7f84a09 (Fri Apr 26 11:07:47 2024 +0000)

 Sat Apr 27 01:21:57 2024
 MAKE THE DIR         : OUT.ABACUS/
 RUNNING WITH DEVICE  : GPU / Device 66a1

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Warning: the number of valence electrons in pseudopotential > 1 for Na: [Ne] 3s1
 Pseudopotentials with additional electrons can yield (more) accurate outcomes, but may be less efficient.
 If you're confident that your chosen pseudopotential is appropriate, you can safely ignore this warning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 UNIFORM GRID DIM        : 60 * 60 * 60
 UNIFORM GRID DIM(BIG)   : 60 * 60 * 60
 DONE(0.314765   SEC) : SETUP UNITCELL
 DONE(0.371343   SEC) : SYMMETRY
 DONE(0.557475   SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Self-consistent calculations for electrons
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  
 1       172             4           
 ---------------------------------------------------------
 Use plane wave basis
 ---------------------------------------------------------
 ELEMENT NATOM       XC          
 Na      16          
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(0.600742   SEC) : INIT PLANEWAVE
 MEMORY FOR PSI (MB)  : 725.031
 DONE(1.38595    SEC) : LOCAL POTENTIAL
 DONE(1.43937    SEC) : NON-LOCAL POTENTIAL
 DONE(1.61337    SEC) : INIT BASIS
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
 START CHARGE      : atomic

job address: https://app.bohrium.dp.tech/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fsched-abacustest-dcu-cg-e4fd08

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

WHUweiqingzhou commented 6 months ago

@denghuilu could you have a look?

denghuilu commented 6 months ago

I have no idea why the same test produces such large fluctuations at different times. Please update the DTK version and retest those daily tests again.

denghuilu commented 6 months ago

Can not be reproduced, here's the rerun log with the same commit of this issue:

denghuilu commented 6 months ago
[aisi@b01r4n18:005_16Na-new]$ mpirun -n 4 ../../abacus-develop/build-dtk-22.10/abacus_pw 
WARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.
Info: Local MPI proc number: 4,OpenMP thread number: 1,Total thread number: 4,Local thread limit: 32

                              ABACUS v3.6.2

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: 7f84a09 (Fri Apr 26 11:07:47 2024 +0000)

 Mon May  6 19:28:13 2024
 MAKE THE DIR         : OUT.ABACUS/
 RUNNING WITH DEVICE  : GPU / Device 66a1

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Warning: the number of valence electrons in pseudopotential > 1 for Na: [Ne] 3s1
 Pseudopotentials with additional electrons can yield (more) accurate outcomes, but may be less efficient.
 If you're confident that your chosen pseudopotential is appropriate, you can safely ignore this warning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 UNIFORM GRID DIM        : 60 * 60 * 60
 UNIFORM GRID DIM(BIG)   : 60 * 60 * 60
 DONE(0.308965   SEC) : SETUP UNITCELL
 DONE(0.370277   SEC) : SYMMETRY
 DONE(0.570227   SEC) : INIT K-POINTS
 ---------------------------------------------------------
 Self-consistent calculations for electrons
 ---------------------------------------------------------
 SPIN    KPOINTS         PROCESSORS  
 1       172             4           
 ---------------------------------------------------------
 Use plane wave basis
 ---------------------------------------------------------
 ELEMENT NATOM       XC          
 Na      16          
 ---------------------------------------------------------
 Initial plane wave basis and FFT box
 ---------------------------------------------------------
 DONE(0.613587   SEC) : INIT PLANEWAVE
 MEMORY FOR PSI (MB)  : 725.031
 DONE(1.42576    SEC) : LOCAL POTENTIAL
 DONE(1.47928    SEC) : NON-LOCAL POTENTIAL
 DONE(1.59711    SEC) : INIT BASIS
 -------------------------------------------
 SELF-CONSISTENT : 
 -------------------------------------------
 START CHARGE      : atomic
 DONE(2.71444    SEC) : INIT SCF
 ITER   ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)    
 CG1    -1.852558e+04  0.000000e+00   4.452e-01  8.279e+01  
 CG2    -1.852683e+04  -1.250318e+00  1.737e-02  1.454e+01  
 CG3    -1.852684e+04  -9.928414e-03  2.124e-03  1.350e+01  
 CG4    -1.852684e+04  -9.865474e-04  1.693e-05  1.224e+01  
 CG5    -1.852684e+04  -1.020882e-04  3.555e-06  1.925e+01  
 CG6    -1.852684e+04  -3.740692e-07  3.261e-06  1.711e+01  
 CG7    -1.852684e+04  -5.870394e-06  2.429e-08  1.371e+01  
----------------------------------------------------------------
TOTAL-STRESS (KBAR)                                           
----------------------------------------------------------------
      369.1146324675        10.4107742035        -1.7071149309
       10.4107742035       375.2382390454        -9.1015912635
       -1.7071149309        -9.1015912635       371.9241383609
----------------------------------------------------------------
 TOTAL-PRESSURE: 372.092337 KBAR

TIME STATISTICS
-------------------------------------------------------------------------------------
     CLASS_NAME                 NAME             TIME(Sec)  CALLS   AVG(Sec) PER(%)
-------------------------------------------------------------------------------------
                     total                       185.76          17  10.93   100.00
Driver               reading                       0.24           1   0.24     0.13
Input                Init                          0.04           1   0.04     0.02
Input_Conv           Convert                       0.18           1   0.18     0.10
Driver               driver_line                 185.52           1 185.52    99.87
UnitCell             check_tau                     0.00           1   0.00     0.00
PW_Basis_Sup         setuptransform                0.01           1   0.01     0.00
PW_Basis_Sup         distributeg                   0.00           1   0.00     0.00
mymath               heapsort                      0.02          41   0.00     0.01
Symmetry             analy_sys                     0.00           1   0.00     0.00
PW_Basis_K           setuptransform                0.03           1   0.03     0.01
PW_Basis_K           distributeg                   0.00           1   0.00     0.00
PW_Basis             setup_struc_factor            0.09           1   0.09     0.05
ppcell_vnl           init                          0.01           1   0.01     0.00
ppcell_vl            init_vloc                     0.70           1   0.70     0.38
ppcell_vnl           init_vnl                      0.05           1   0.05     0.03
WF_atomic            init_at_1                     0.00           1   0.00     0.00
wavefunc             wfcinit                       0.01           1   0.01     0.00
Ions                 opt_ions                    184.03           1 184.03    99.07
ESolver_KS_PW        run                         174.70           1 174.70    94.05
H_Ewald_pw           compute_ewald                 0.01           1   0.01     0.00
Charge               set_rho_core                  0.00           1   0.00     0.00
Charge               atomic_rho                    0.76           1   0.76     0.41
PW_Basis_Sup         recip2real                    0.59          60   0.01     0.32
PW_Basis_Sup         gathers_scatterp              0.03          60   0.00     0.01
Potential            init_pot                      0.28           1   0.28     0.15
Potential            update_from_charge            2.09           8   0.26     1.13
Potential            cal_fixed_v                   0.01           1   0.01     0.01
PotLocal             cal_fixed_v                   0.01           1   0.01     0.01
Potential            cal_v_eff                     2.08           8   0.26     1.12
H_Hartree_pw         v_hartree                     0.18           8   0.02     0.09
PW_Basis_Sup         real2recip                    0.74          79   0.01     0.40
PW_Basis_Sup         gatherp_scatters              0.02          79   0.00     0.01
PotXC                cal_v_eff                     1.90           8   0.24     1.02
XC_Functional        v_xc                          1.89           8   0.24     1.02
Potential            interpolate_vrs               0.00           8   0.00     0.00
Symmetry             rhog_symmetry                 0.25           9   0.03     0.13
Symmetry             group fft grids               0.08           9   0.01     0.04
Charge_Mixing        init_mixing                   0.00           1   0.00     0.00
ESolver_KS_PW        hamilt2density              171.04           8  21.38    92.08
HSolverPW            solve                       170.63           8  21.33    91.86
Nonlocal             getvnl                        0.49         344   0.00     0.26
pp_cell_vnl          getvnl                        0.57         430   0.00     0.31
Structure_Factor     get_sk                        1.09        3870   0.00     0.59
WF_atomic            atomic_wfc                    0.22          43   0.01     0.12
DiagoIterAssist      diagH_subspace_init           5.73          43   0.13     3.09
Operator             hPsi                         79.78      115332   0.00    42.95
Operator             EkineticPW                    6.46      115332   0.00     3.48
Operator             VeffPW                       53.23      115332   0.00    28.65
PW_Basis_K           recip_to_real gpu            29.52      170501   0.00    15.89
PW_Basis_K           real_to_recip gpu            22.86      140917   0.00    12.31
Operator             NonlocalPW                   19.41      115332   0.00    10.45
Nonlocal             add_nonlocal_pp              15.01      115332   0.00     8.08
DiagoIterAssist      diagH_LAPACK                  1.37         301   0.00     0.74
DiagoCG              diag_once                   132.90         344   0.39    71.55
DiagoCG_New          spsi_func                     8.77      230062   0.00     4.72
DiagoCG_New          hpsi_func                    69.90      115031   0.00    37.63
ElecStatePW          psiToRho                      6.54           8   0.82     3.52
Charge               rho_mpi                       0.01           8   0.00     0.00
Charge               reduce_diff_pools             0.01           8   0.00     0.00
Charge_Mixing        get_drho                      0.16           8   0.02     0.09
Charge_Mixing        inner_product_recip_rho       0.01           8   0.00     0.00
Charge               mix_rho                       0.10           6   0.02     0.05
Charge               Broyden_mixing                0.02           6   0.00     0.01
DiagoIterAssist      diagH_subspace               11.19         258   0.04     6.02
Charge_Mixing        inner_product_recip_hartree   0.02          30   0.00     0.01
Forces               cal_force_loc                 0.08           1   0.08     0.04
Forces               cal_force_ew                  0.07           1   0.07     0.04
Forces               cal_force_nl                  0.44           1   0.44     0.24
Forces               cal_force_cc                  0.00           1   0.00     0.00
Forces               cal_force_scc                 0.87           1   0.87     0.47
Stress_PW            cal_stress                    7.86           1   7.86     4.23
Stress_Func          stress_kin                    1.09           1   1.09     0.59
Stress_Func          stress_har                    0.01           1   0.01     0.01
Stress_Func          stress_ewa                    0.08           1   0.08     0.05
Stress_Func          stress_gga                    0.15           1   0.15     0.08
Stress_Func          stress_loc                    1.16           1   1.16     0.62
Stress_Func          stress_cc                     0.00           1   0.00     0.00
Stress_Func          stress_nl                     5.36           1   5.36     2.89
ModuleIO             write_istate_info             0.13           1   0.13     0.07
-------------------------------------------------------------------------------------

 START  Time  : Mon May  6 19:28:13 2024
 FINISH Time  : Mon May  6 19:31:19 2024
 TOTAL  Time  : 186
 SEE INFORMATION IN : OUT.ABACUS/
pxlxingliang commented 6 months ago

I have 3 more cases have similar error. While it can be normal running when I re-submit the job after 2 days. All 3 jobs are run on node: j20r4n07. I suspect that it is the problem of node j20r4n07.

e.zip

WHUweiqingzhou commented 5 months ago

This issue is from the machine issue, not related with ABACUS.