deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
174 stars 136 forks source link

Bugs: the results of different parallel schemes vary greatly for LCAO calculations #4122

Open WHUweiqingzhou opened 6 months ago

WHUweiqingzhou commented 6 months ago

Describe the bug

During the test of issue #4058, I find results of different parallel settings are totally different for same INPUT:

OMP_NUM_THREADS=1 mpirun -np 16 abacus | tee out.log
OMP_NUM_THREADS=2 mpirun -np 16 abacus | tee out.log
OMP_NUM_THREADS=2 mpirun -np 8 abacus | tee out.log
OMP_NUM_THREADS=4 mpirun -np 4 abacus | tee out.log

image

see more in link

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

WHUweiqingzhou commented 6 months ago

I also made tests by using GNU image @dyzheng, I find the calculations are also unstable, but better than Intel image: image

But for unconverged INPUT, the calculations are more unstable: image

See more in link.

WHUweiqingzhou commented 6 months ago

As for different version:

see link.

For v3.3.2, the results of STRU1 and STRU2 are different:

image

For v3.4.0, the results of STRU1 and STRU2 with different MPI are almost same:

image

For v3.5.0, the result of STRU1 and STRU2 with different MPI are different:

image

For v3.6.0, the result is same as v3.5.0

image

It looks like v3.4.0 behaves well, something changed between v3.4.0 and v3.5.0

WHUweiqingzhou commented 6 months ago

I choose some commit to make tests, see the link.

For 38766b4a, 2023/9/28: image

For 2ffa3d4e, 2023/10/9. It looks like drho changes after this commit: image

For 77f178d0, 2023/10/26: image

For 57c903ae, 2023/11/03: image

For fd76546b, 2023/11/23: image

@Qianruipku, could you have a look?

WHUweiqingzhou commented 6 months ago

I try the commit a5abaea0, which is just before 2ffa3d4: image

I confirm this change happen at 2ffa3d4, see link.

WHUweiqingzhou commented 6 months ago

I try mixing_type = pulay and mixing_ndim=21 at a5abaea, and get the result. It looks like old pulay (broyden now) is not stable in this case? image

link

WHUweiqingzhou commented 6 months ago

@Qianruipku I try different mixing_gg0=0 and scf_thr_type=1 at 2ffa3d4e, and find the result is same as Broyden result of a5abaea0: see the link. For a5abaea0:

START CHARGE      : atomic
 DONE(1.32678    SEC) : INIT SCF
 ITER   TMAG      AMAG      ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)    
 GE1    3.13e+01  3.21e+01  -2.012837e+05  0.000000e+00   4.623e-02  2.619e+01  
 GE2    3.62e+01  3.73e+01  -2.012882e+05  -4.475781e+00  1.512e-02  2.198e+01  
 GE3    3.47e+01  3.63e+01  -2.012885e+05  -3.216326e-01  1.092e-02  2.198e+01  
 GE4    3.46e+01  3.70e+01  -2.012879e+05  5.839711e-01   1.606e-02  2.201e+01  
 GE5    3.44e+01  3.67e+01  -2.012886e+05  -6.875509e-01  2.846e-03  2.197e+01  
 GE6    3.58e+01  3.81e+01  -2.012850e+05  3.610166e+00   3.639e-02  2.200e+01  
 GE7    3.43e+01  3.68e+01  -2.012886e+05  -3.587337e+00  4.259e-03  2.197e+01  
 GE8    3.42e+01  3.68e+01  -2.012886e+05  -4.320307e-02  1.462e-03  2.201e+01  
 GE9    3.43e+01  3.68e+01  -2.012886e+05  5.644168e-02   4.966e-03  2.200e+01  
 GE10   3.42e+01  3.68e+01  -2.012886e+05  -4.094097e-02  3.658e-03  2.201e+01  
 GE11   3.41e+01  3.69e+01  -2.012887e+05  -1.928839e-02  1.539e-03  2.202e+01  
 GE12   3.41e+01  3.69e+01  -2.012887e+05  -2.543738e-03  1.574e-03  2.202e+01  
 GE13   3.41e+01  3.69e+01  -2.012887e+05  -6.717234e-03  4.667e-04  2.203e+01  
 GE14   3.41e+01  3.69e+01  -2.012887e+05  2.690787e-03   1.217e-03  2.203e+01  
 GE15   3.41e+01  3.69e+01  -2.012887e+05  -3.728993e-03  4.753e-04  2.204e+01  
 GE16   3.41e+01  3.69e+01  -2.012887e+05  -6.213324e-04  3.090e-04  2.205e+01  
 GE17   3.41e+01  3.69e+01  -2.012887e+05  1.019319e-03   6.257e-04  2.205e+01  
 GE18   3.41e+01  3.69e+01  -2.012887e+05  -1.727669e-03  3.054e-04  2.206e+01  
 GE19   3.41e+01  3.69e+01  -2.012887e+05  -2.660938e-04  1.692e-04  2.212e+01  
 GE20   3.41e+01  3.69e+01  -2.012887e+05  -4.791429e-05  1.023e-04  2.209e+01  
 GE21   3.41e+01  3.69e+01  -2.012887e+05  -2.845928e-05  1.066e-04  2.212e+01  
 GE22   3.41e+01  3.69e+01  -2.012887e+05  -4.190659e-06  7.938e-05  2.217e+01  
 GE23   3.41e+01  3.69e+01  -2.012887e+05  -1.090570e-05  5.489e-05  2.213e+01  
 GE24   3.41e+01  3.69e+01  -2.012887e+05  7.023658e-08   6.898e-05  2.212e+01  
 GE25   3.41e+01  3.68e+01  -2.012887e+05  -1.937866e-06  5.804e-05  2.213e+01  
 GE26   3.41e+01  3.68e+01  -2.012887e+05  -1.122038e-05  2.331e-05  2.214e+01  
 GE27   3.41e+01  3.68e+01  -2.012887e+05  -7.857934e-07  2.666e-05  2.216e+01  
 GE28   3.41e+01  3.68e+01  -2.012887e+05  2.993593e-07   2.932e-05  2.215e+01  
 GE29   3.41e+01  3.68e+01  -2.012887e+05  -2.109869e-06  1.792e-05  2.213e+01  
 GE30   3.41e+01  3.68e+01  -2.012887e+05  3.184652e-07   2.027e-05  2.217e+01  
 GE31   3.41e+01  3.68e+01  -2.012887e+05  1.038596e-05   6.569e-05  2.214e+01  
 GE32   3.41e+01  3.68e+01  -2.012887e+05  -1.034819e-05  2.186e-05  2.214e+01  
 GE33   3.41e+01  3.68e+01  -2.012887e+05  4.226644e-06   4.674e-05  2.217e+01  
 GE34   3.41e+01  3.68e+01  -2.012887e+05  -1.963234e-06  3.550e-05  2.217e+01  
 GE35   3.41e+01  3.68e+01  -2.012887e+05  -2.668124e-06  2.238e-05  2.217e+01  
 GE36   3.41e+01  3.68e+01  -2.012887e+05  -7.664895e-07  1.254e-05  2.217e+01  
 GE37   3.41e+01  3.68e+01  -2.012887e+05  2.685720e-07   1.899e-05  2.217e+01  
 GE38   3.41e+01  3.68e+01  -2.012887e+05  5.085099e-07   2.371e-05  2.215e+01  
 GE39   3.41e+01  3.68e+01  -2.012887e+05  -1.756162e-07  2.297e-05  2.217e+01  
 GE40   3.41e+01  3.68e+01  -2.012887e+05  -1.341152e-06  1.120e-05  2.217e+01  
 GE41   3.41e+01  3.68e+01  -2.012887e+05  4.999221e-09   7.053e-06  2.215e+01  
 GE42   3.41e+01  3.68e+01  -2.012887e+05  6.138648e-07   1.840e-05  2.215e+01  
 GE43   3.41e+01  3.68e+01  -2.012887e+05  -8.628854e-07  9.731e-06  2.217e+01  
 GE44   3.41e+01  3.68e+01  -2.012887e+05  -1.153533e-07  6.617e-06  2.218e+01  
 GE45   3.41e+01  3.68e+01  -2.012887e+05  -4.853204e-08  5.367e-06  2.218e+01  
 GE46   3.41e+01  3.68e+01  -2.012887e+05  -2.341219e-08  5.924e-06  2.220e+01  

For 2ffa3d4e:

ITER   TMAG      AMAG      ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)    
 GE1    3.13e+01  3.21e+01  -2.012837e+05  0.000000e+00   2.314e+00  2.637e+01  
 GE2    3.62e+01  3.73e+01  -2.012882e+05  -4.475781e+00  2.857e-01  2.267e+01  
 GE3    3.47e+01  3.63e+01  -2.012885e+05  -3.216326e-01  8.493e-02  2.270e+01  
 GE4    3.46e+01  3.70e+01  -2.012879e+05  5.839711e-01   5.170e+00  2.271e+01  
 GE5    3.44e+01  3.67e+01  -2.012886e+05  -6.873429e-01  1.088e-02  2.270e+01  
 GE6    3.58e+01  3.81e+01  -2.012850e+05  3.617410e+00   3.629e+02  2.273e+01  
 GE7    3.43e+01  3.68e+01  -2.012886e+05  -3.594388e+00  1.362e+00  2.271e+01  
 GE8    3.42e+01  3.68e+01  -2.012886e+05  -4.305293e-02  2.881e-02  2.273e+01  
 GE9    3.42e+01  3.68e+01  -2.012886e+05  3.112351e-02   3.512e+00  2.277e+01  
 GE10   3.42e+01  3.68e+01  -2.012886e+05  -1.827247e-02  4.005e-01  2.271e+01  
 GE11   3.41e+01  3.69e+01  -2.012887e+05  -1.670464e-02  1.139e-01  2.261e+01  
 GE12   3.41e+01  3.69e+01  -2.012887e+05  -2.902616e-03  1.525e-01  2.242e+01  
 GE13   3.41e+01  3.69e+01  -2.012887e+05  -6.588764e-03  1.453e-03  2.240e+01  
 GE14   3.41e+01  3.69e+01  -2.012887e+05  3.300182e-04   1.390e-02  2.240e+01  
 GE15   3.41e+01  3.69e+01  -2.012887e+05  4.628865e-03   8.823e-02  2.238e+01  
 GE16   3.41e+01  3.69e+01  -2.012887e+05  -6.824514e-03  5.390e-04  2.241e+01  
 GE17   3.41e+01  3.69e+01  -2.012887e+05  4.037616e-04   3.347e-03  2.226e+01  
 GE18   3.41e+01  3.69e+01  -2.012887e+05  -1.141057e-03  1.147e-03  2.222e+01  
 GE19   3.41e+01  3.69e+01  -2.012887e+05  -2.858993e-04  6.974e-05  2.222e+01  
 GE20   3.41e+01  3.69e+01  -2.012887e+05  -4.719985e-05  2.939e-05  2.225e+01  
 GE21   3.41e+01  3.69e+01  -2.012887e+05  -2.902679e-05  4.334e-05  2.225e+01  
 GE22   3.41e+01  3.69e+01  -2.012887e+05  -3.342697e-06  3.658e-05  2.226e+01  
 GE23   3.41e+01  3.69e+01  -2.012887e+05  -1.117724e-05  1.266e-05  2.224e+01  
 GE24   3.41e+01  3.69e+01  -2.012887e+05  -1.517585e-07  4.617e-05  2.225e+01  
 GE25   3.41e+01  3.68e+01  -2.012887e+05  -6.274518e-07  6.574e-05  2.228e+01  
 GE26   3.41e+01  3.68e+01  -2.012887e+05  -1.256247e-05  6.464e-06  2.228e+01  
 GE27   3.41e+01  3.68e+01  -2.012887e+05  -1.080055e-06  7.928e-06  2.229e+01  
 GE28   3.41e+01  3.68e+01  -2.012887e+05  2.439966e-07   2.441e-05  2.229e+01  
 GE29   3.41e+01  3.68e+01  -2.012887e+05  -1.845282e-06  1.931e-05  2.228e+01  
 GE30   3.41e+01  3.68e+01  -2.012887e+05  3.163369e-07   1.136e-05  2.228e+01  
 GE31   3.41e+01  3.68e+01  -2.012887e+05  4.611435e-06   5.006e-04  2.226e+01  
 GE32   3.41e+01  3.68e+01  -2.012887e+05  -4.753171e-06  7.248e-05  2.231e+01  
 GE33   3.41e+01  3.68e+01  -2.012887e+05  3.359972e-06   8.001e-05  2.231e+01  
 GE34   3.41e+01  3.68e+01  -2.012887e+05  -2.829436e-06  4.059e-05  2.228e+01  
 GE35   3.41e+01  3.68e+01  -2.012887e+05  -6.033466e-07  2.800e-05  2.229e+01  
 GE36   3.41e+01  3.68e+01  -2.012887e+05  -1.081985e-06  3.135e-06  2.230e+01  
 GE37   3.41e+01  3.68e+01  -2.012887e+05  8.878815e-07   1.789e-05  2.227e+01  
 GE38   3.41e+01  3.68e+01  -2.012887e+05  6.261401e-08   2.357e-05  2.226e+01  
 GE39   3.41e+01  3.68e+01  -2.012887e+05  -5.403119e-07  1.372e-05  2.225e+01  
 GE40   3.41e+01  3.68e+01  -2.012887e+05  -8.631824e-07  2.357e-06  2.227e+01  
 GE41   3.41e+01  3.68e+01  -2.012887e+05  5.617541e-06   4.904e-07  2.251e+01
jinzx10 commented 6 months ago

I've got two questions:

  1. It was shown in #2997 that, even if the parallalization scheme is the same, LCAO calculation may still be unstable for some systems. Are calculations in this PR stable from run to run? [We conjectured that #2997 might result from a nearly-singular overlap matrix, but so far it is not confirmed and we do not have solution in the near term.]
  2. If calculations in this PR are stable on their own, I was wondering, is it possible to further nail down the problem to MPI or openMP (or both)?