Closed pxlxingliang closed 5 months ago
@denghuilu could you have a look?
I have no idea why the same test produces such large fluctuations at different times. Please update the DTK version and retest those daily tests again.
Can not be reproduced, here's the rerun log with the same commit of this issue:
[aisi@b01r4n18:005_16Na-new]$ mpirun -n 4 ../../abacus-develop/build-dtk-22.10/abacus_pw
WARNING: Total thread number on this node mismatches with hardware availability. This may cause poor performance.
Info: Local MPI proc number: 4,OpenMP thread number: 1,Total thread number: 4,Local thread limit: 32
ABACUS v3.6.2
Atomic-orbital Based Ab-initio Computation at UStc
Website: http://abacus.ustc.edu.cn/
Documentation: https://abacus.deepmodeling.com/
Repository: https://github.com/abacusmodeling/abacus-develop
https://github.com/deepmodeling/abacus-develop
Commit: 7f84a09 (Fri Apr 26 11:07:47 2024 +0000)
Mon May 6 19:28:13 2024
MAKE THE DIR : OUT.ABACUS/
RUNNING WITH DEVICE : GPU / Device 66a1
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Warning: the number of valence electrons in pseudopotential > 1 for Na: [Ne] 3s1
Pseudopotentials with additional electrons can yield (more) accurate outcomes, but may be less efficient.
If you're confident that your chosen pseudopotential is appropriate, you can safely ignore this warning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
UNIFORM GRID DIM : 60 * 60 * 60
UNIFORM GRID DIM(BIG) : 60 * 60 * 60
DONE(0.308965 SEC) : SETUP UNITCELL
DONE(0.370277 SEC) : SYMMETRY
DONE(0.570227 SEC) : INIT K-POINTS
---------------------------------------------------------
Self-consistent calculations for electrons
---------------------------------------------------------
SPIN KPOINTS PROCESSORS
1 172 4
---------------------------------------------------------
Use plane wave basis
---------------------------------------------------------
ELEMENT NATOM XC
Na 16
---------------------------------------------------------
Initial plane wave basis and FFT box
---------------------------------------------------------
DONE(0.613587 SEC) : INIT PLANEWAVE
MEMORY FOR PSI (MB) : 725.031
DONE(1.42576 SEC) : LOCAL POTENTIAL
DONE(1.47928 SEC) : NON-LOCAL POTENTIAL
DONE(1.59711 SEC) : INIT BASIS
-------------------------------------------
SELF-CONSISTENT :
-------------------------------------------
START CHARGE : atomic
DONE(2.71444 SEC) : INIT SCF
ITER ETOT(eV) EDIFF(eV) DRHO TIME(s)
CG1 -1.852558e+04 0.000000e+00 4.452e-01 8.279e+01
CG2 -1.852683e+04 -1.250318e+00 1.737e-02 1.454e+01
CG3 -1.852684e+04 -9.928414e-03 2.124e-03 1.350e+01
CG4 -1.852684e+04 -9.865474e-04 1.693e-05 1.224e+01
CG5 -1.852684e+04 -1.020882e-04 3.555e-06 1.925e+01
CG6 -1.852684e+04 -3.740692e-07 3.261e-06 1.711e+01
CG7 -1.852684e+04 -5.870394e-06 2.429e-08 1.371e+01
----------------------------------------------------------------
TOTAL-STRESS (KBAR)
----------------------------------------------------------------
369.1146324675 10.4107742035 -1.7071149309
10.4107742035 375.2382390454 -9.1015912635
-1.7071149309 -9.1015912635 371.9241383609
----------------------------------------------------------------
TOTAL-PRESSURE: 372.092337 KBAR
TIME STATISTICS
-------------------------------------------------------------------------------------
CLASS_NAME NAME TIME(Sec) CALLS AVG(Sec) PER(%)
-------------------------------------------------------------------------------------
total 185.76 17 10.93 100.00
Driver reading 0.24 1 0.24 0.13
Input Init 0.04 1 0.04 0.02
Input_Conv Convert 0.18 1 0.18 0.10
Driver driver_line 185.52 1 185.52 99.87
UnitCell check_tau 0.00 1 0.00 0.00
PW_Basis_Sup setuptransform 0.01 1 0.01 0.00
PW_Basis_Sup distributeg 0.00 1 0.00 0.00
mymath heapsort 0.02 41 0.00 0.01
Symmetry analy_sys 0.00 1 0.00 0.00
PW_Basis_K setuptransform 0.03 1 0.03 0.01
PW_Basis_K distributeg 0.00 1 0.00 0.00
PW_Basis setup_struc_factor 0.09 1 0.09 0.05
ppcell_vnl init 0.01 1 0.01 0.00
ppcell_vl init_vloc 0.70 1 0.70 0.38
ppcell_vnl init_vnl 0.05 1 0.05 0.03
WF_atomic init_at_1 0.00 1 0.00 0.00
wavefunc wfcinit 0.01 1 0.01 0.00
Ions opt_ions 184.03 1 184.03 99.07
ESolver_KS_PW run 174.70 1 174.70 94.05
H_Ewald_pw compute_ewald 0.01 1 0.01 0.00
Charge set_rho_core 0.00 1 0.00 0.00
Charge atomic_rho 0.76 1 0.76 0.41
PW_Basis_Sup recip2real 0.59 60 0.01 0.32
PW_Basis_Sup gathers_scatterp 0.03 60 0.00 0.01
Potential init_pot 0.28 1 0.28 0.15
Potential update_from_charge 2.09 8 0.26 1.13
Potential cal_fixed_v 0.01 1 0.01 0.01
PotLocal cal_fixed_v 0.01 1 0.01 0.01
Potential cal_v_eff 2.08 8 0.26 1.12
H_Hartree_pw v_hartree 0.18 8 0.02 0.09
PW_Basis_Sup real2recip 0.74 79 0.01 0.40
PW_Basis_Sup gatherp_scatters 0.02 79 0.00 0.01
PotXC cal_v_eff 1.90 8 0.24 1.02
XC_Functional v_xc 1.89 8 0.24 1.02
Potential interpolate_vrs 0.00 8 0.00 0.00
Symmetry rhog_symmetry 0.25 9 0.03 0.13
Symmetry group fft grids 0.08 9 0.01 0.04
Charge_Mixing init_mixing 0.00 1 0.00 0.00
ESolver_KS_PW hamilt2density 171.04 8 21.38 92.08
HSolverPW solve 170.63 8 21.33 91.86
Nonlocal getvnl 0.49 344 0.00 0.26
pp_cell_vnl getvnl 0.57 430 0.00 0.31
Structure_Factor get_sk 1.09 3870 0.00 0.59
WF_atomic atomic_wfc 0.22 43 0.01 0.12
DiagoIterAssist diagH_subspace_init 5.73 43 0.13 3.09
Operator hPsi 79.78 115332 0.00 42.95
Operator EkineticPW 6.46 115332 0.00 3.48
Operator VeffPW 53.23 115332 0.00 28.65
PW_Basis_K recip_to_real gpu 29.52 170501 0.00 15.89
PW_Basis_K real_to_recip gpu 22.86 140917 0.00 12.31
Operator NonlocalPW 19.41 115332 0.00 10.45
Nonlocal add_nonlocal_pp 15.01 115332 0.00 8.08
DiagoIterAssist diagH_LAPACK 1.37 301 0.00 0.74
DiagoCG diag_once 132.90 344 0.39 71.55
DiagoCG_New spsi_func 8.77 230062 0.00 4.72
DiagoCG_New hpsi_func 69.90 115031 0.00 37.63
ElecStatePW psiToRho 6.54 8 0.82 3.52
Charge rho_mpi 0.01 8 0.00 0.00
Charge reduce_diff_pools 0.01 8 0.00 0.00
Charge_Mixing get_drho 0.16 8 0.02 0.09
Charge_Mixing inner_product_recip_rho 0.01 8 0.00 0.00
Charge mix_rho 0.10 6 0.02 0.05
Charge Broyden_mixing 0.02 6 0.00 0.01
DiagoIterAssist diagH_subspace 11.19 258 0.04 6.02
Charge_Mixing inner_product_recip_hartree 0.02 30 0.00 0.01
Forces cal_force_loc 0.08 1 0.08 0.04
Forces cal_force_ew 0.07 1 0.07 0.04
Forces cal_force_nl 0.44 1 0.44 0.24
Forces cal_force_cc 0.00 1 0.00 0.00
Forces cal_force_scc 0.87 1 0.87 0.47
Stress_PW cal_stress 7.86 1 7.86 4.23
Stress_Func stress_kin 1.09 1 1.09 0.59
Stress_Func stress_har 0.01 1 0.01 0.01
Stress_Func stress_ewa 0.08 1 0.08 0.05
Stress_Func stress_gga 0.15 1 0.15 0.08
Stress_Func stress_loc 1.16 1 1.16 0.62
Stress_Func stress_cc 0.00 1 0.00 0.00
Stress_Func stress_nl 5.36 1 5.36 2.89
ModuleIO write_istate_info 0.13 1 0.13 0.07
-------------------------------------------------------------------------------------
START Time : Mon May 6 19:28:13 2024
FINISH Time : Mon May 6 19:31:19 2024
TOTAL Time : 186
SEE INFORMATION IN : OUT.ABACUS/
I have 3 more cases have similar error. While it can be normal running when I re-submit the job after 2 days. All 3 jobs are run on node: j20r4n07. I suspect that it is the problem of node j20r4n07.
This issue is from the machine issue, not related with ABACUS.
Describe the bug
The dcu daily test at 0427, one example (005) has the below error before SCF:
The job is stopped at:
job address: https://app.bohrium.dp.tech/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fsched-abacustest-dcu-cg-e4fd08
Expected behavior
No response
To Reproduce
No response
Environment
No response
Additional Context
No response
Task list for Issue attackers (only for developers)