deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
172 stars 130 forks source link

Unstable LCAO calculation of 004_Li128C75H100O75 #2997

Open pxlxingliang opened 1 year ago

pxlxingliang commented 1 year ago

Describe the bug

The LCAO calculation of daily test 004_Li128C75H100O75 is unstable. 004_Li128C75H100O75.zip

At current version (20230921 develop branch), the intel compiled calculation is stable, but GNU compiled is unstable.

Details of some testing will be updated at below.

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers

pxlxingliang commented 1 year ago

Testing at 20230907: Using 20230907 develop codes, and test the running on different Bohrium machines and different OMP and MPI parallel strategies. The image is: "registry.dp.tech/deepmodeling/abacus-intel:latest" where use intel compiler. 6 different tests, and each type run repeated 10 times.

  1. c32_m128_cpu/omp16 mpi1
  2. c32_m128_cpu/omp1 mpi16
  3. c32_m256_cpu/omp16 mpi1
  4. c64_m256_cpu/omp32 mpi1
  5. c32_m128_cpu/omp8 mpi1
  6. c64_m256_cpu/omp16 mpi1 c32_m128_cpu means: hyper thread 32 (16 physical cores) and memory 128 G. omp16mpi1 means: running with OMP_THREADS=16 and mpi -np 1.

Detail results can be found in: https://labs.dp.tech/projects/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.23-32a3fd

drho of each runs: ba6fa4de1aa75d4ef7b5f9c6c675d8f9__fallback_source=1 height=1280 mount_node_token=EbJBdNdWAocuOVxH5QIckGUpnOh mount_point=docx_image policy=equal width=1280

As we can see, only parallel with mpi can keep the stable results, and is not related to the type of machine.

pxlxingliang commented 1 year ago

Testing at 9-11: Set fp-model to strict in intel compiling. machine: c32_m128_cpu image: registry.dp.tech/deepmodeling/abacus-intel:latest drho: a4bfa1ae9ba486ad97d4dc5c994cebcb__fallback_source=1 height=1280 mount_node_token=RlLWdlVn9oDOjLxsTGYcusn8nWf mount_point=docx_image policy=equal width=1280

Details: https://labs.dp.tech/projects/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-004test-abacustest-v0.3.29-797207

The 10 elpa are stable, and scalapack are unstable.

pxlxingliang commented 1 year ago

9-13: @jinzx10 modify the setting of orfac and work array in the calculation of scalapack. (@jinzx10 please supply the detail modification) The calculation of scalpack is much more stable. I tested on 10 c32_m128_cpu machines, the drho display two types of results, and same for elpa. 0bfce1118faceaff2f2eecbd6b409d55_d9bd537f-4004-43c3-92c9-b729200c3b19

https://labs.dp.tech/projects/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.31-abf0eb

Checked with Bohrium, the 10 jobs of elpa use the same machine type: ecs.u1-c1m4.8xlarge (ali), but we are not sure the cpu are exactly same.

I also run 10 jobs on one machine, and both elpa and scalapack display one type of results. e4b508dc7da1cd80fb80e24127ff529c_8db73254-bafb-49c8-b4a5-29aa4f8d1aee https://labs.dp.tech/projects/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.31-0b15d3

Collaborate with Bohrium, we run 10 jobs on 10 cpu "Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz", both elpa and scalapack has one type of results. be414c9194903c0b17cbe76ac6758a98_30ab3e4e-88ec-42ff-b130-996d07d3c11e https://labs.dp.tech/projects/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.31-494c1f

Conclusion: When using the modified setting of orfac and work array in the calculation of scalapack, the elpa and scalpack has two types of results, but can be stable on the same CPU (at least on Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz is stable).

pxlxingliang commented 1 year ago

9-13 testing on GNU compiler: Both elpa and scalpack are unstable on GNU compiler. Only test MPI parallel (the OMP parallel in GNU is terrible) 8d664f37c62dc929174a517d79d71cd9_d08c3ae0-bc1d-43d4-95e8-fbc469318b41 https://labs.dp.tech/projects/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.30-b93f76

pxlxingliang commented 1 year ago

@jinzx10 @LiuXiaohui123321 The testing on 004_Li128C75H100O75 is updated here. If you have other testing results of this example, please also update here.

jinzx10 commented 1 year ago

The changes I made to the scalapack diagonalization interface (pdsygvx & pzhegvx in module_hsolver/diago_blas.cpp) include orfac and a few work-space-related parameters:

diff --git a/source/module_hsolver/diago_blas.cpp b/source/module_hsolver/diago_blas.cpp
index 489be9f09..8c725857c 100644
--- a/source/module_hsolver/diago_blas.cpp
+++ b/source/module_hsolver/diago_blas.cpp
@@ -62,7 +62,7 @@ std::pair<int, std::vector<int>> DiagoBlas::pdsygvx_once(const int *const desc,
     const int itype = 1, il = 1, iu = GlobalV::NBANDS, one = 1;
     int M = 0, NZ = 0, lwork = -1, liwork = -1, info = 0;
     double vl = 0, vu = 0;
-    const double abstol = 0, orfac = -1;
+    const double abstol = 0, orfac = 0.01;
     std::vector<double> work(3, 0);
     std::vector<int> iwork(1, 0);
     std::vector<int> ifail(GlobalV::NLOCAL, 0);
@@ -109,6 +109,10 @@ std::pair<int, std::vector<int>> DiagoBlas::pdsygvx_once(const int *const desc,
                                  + ModuleBase::GlobalFunc::TO_STRING(__LINE__));

     // GlobalV::ofs_running<<"lwork="<<work[0]<<"\t"<<"liwork="<<iwork[0]<<std::endl;
+
+    work[0] *= 10;
+    iwork[0] *= 10;
+
     lwork = work[0];
     work.resize(std::max(lwork,3), 0);
     liwork = iwork[0];
@@ -184,7 +188,7 @@ std::pair<int, std::vector<int>> DiagoBlas::pzhegvx_once(const int *const desc,
     const char jobz = 'V', range = 'I', uplo = 'U';
     const int itype = 1, il = 1, iu = GlobalV::NBANDS, one = 1;
     int M = 0, NZ = 0, lwork = -1, lrwork = -1, liwork = -1, info = 0;
-    const double abstol = 0, orfac = -1;
+    const double abstol = 0, orfac = 0.01;
     //Note: pzhegvx_ has a bug
     //      We must give vl,vu a value, although we do not use range 'V'
     //      We must give rwork at least a memory of sizeof(double) * 3
@@ -238,6 +242,12 @@ std::pair<int, std::vector<int>> DiagoBlas::pzhegvx_once(const int *const desc,
                                  + ModuleBase::GlobalFunc::TO_STRING(__LINE__));

     // GlobalV::ofs_running<<"lwork="<<work[0]<<"\t"<<"lrwork="<<rwork[0]<<"\t"<<"liwork="<<iwork[0]<<std::endl;
+
+    work[0] *= 10.0;
+    iwork[0] *= 10;
+    rwork[0] *= 10;
+
+
     lwork = work[0].real();
     work.resize(lwork, 0);
     lrwork = rwork[0] + this->degeneracy_max * GlobalV::NLOCAL;
@@ -402,4 +412,4 @@ void DiagoBlas::post_processing(const int info, const std::vector<int> &vec)
     }
 }

According to the source file (https://netlib.org/scalapack/explore-html/d7/dff/pzhegvx_8f_source.html), orthogonality of eigenvectors could be an issue if there are many eigenvectors with close eigenvalues. pzhegvx does provide ways to guarantee orthogonality, but it's very tricky and depends on a few parameters.

orfac is the threshold used to determine what eigenvectors are considered close enough that needs reorthogonalization. The default is 1e-3, which I changed to 1e-2 in the above test with modified scalapack. The size of work space array should also increase. In the test I simply increase them by a factor of 10, which should not be optimal and could be improved.

QuantumMisaka commented 1 year ago

Did Intel compiled ABACUS compiled by icpx? In my test, icpc compiled abacus will also be unstable (But the problem can also be in toolchain compile method)

jinzx10 commented 1 year ago

As a side note, I notice that abacus always solves the eigenvalue equation in the basis of all orbitals. I know that many quantum chemistry softwares using gaussian basis entail an extra canonical orthogonalization to "project out" basis orbitals that are almost linearly dependent. Resulting eigenvalue equations are usually more stable. Some explanation can be found in the qchem's manual (https://manual.q-chem.com/latest/sec_Basis_Customization.html) or Szabo & Ostlund's book (ch 3.4.5). I'm not sure if numerical atomic orbitals should use a similar strategy (and it inevitably complicates the code for MPI parallelization where matrices are stored in a block-cyclic format).

jinzx10 commented 1 year ago

Did Intel compiled ABACUS compiled by icpx? In my test, icpc compiled abacus will also be unstable (But the problem can also be in toolchain compile method)

I think it depends on some enviroment variables like CXX or I_MPI_CXX. Environment set up by current Dockerfile.intel would use icpx.

QuantumMisaka commented 1 year ago

As a side note, I notice that abacus always solves the eigenvalue equation in the basis of all orbitals. I know that many quantum chemistry softwares using gaussian basis entail an extra canonical orthogonalization to "project out" basis orbitals that are almost linearly dependent. Resulting eigenvalue equations are usually more stable. Some explanation can be found in the qchem's manual (https://manual.q-chem.com/latest/sec_Basis_Customization.html) or Szabo & Ostlund's book (ch 3.4.5). I'm not sure if numerical atomic orbitals should use a similar strategy (and it inevitably complicates the code for MPI parallelization where matrices are stored in a block-cyclic format).

In my recollection, ABACUS LCAO solve eigenvalue equation by directly doing einsum for generalized eigenvalue equation, (I don't know whether it is true), but it is sure that most of quantum chemistry software make this done by canonical orthogonalization or other orthogonalization method. I'm wondering for a while whether method is better.

hongriTianqi commented 1 year ago

This is related to input parameter and device.

WHUweiqingzhou commented 1 year ago

Hi all, Recently I am doing some tests for mixing, and I also notice that the unstable LCAO calculation of 004_Li128C75H100O75. I do the calculations 10 times for different mixing method. ABACUS test Link Interestingly, I found the calculations is relatively stable with Broyden mixing, while unstable cases occur with Pulay mixing. To double check, I do an extra calculation (20 times), and find Broyden calculations are also unstable. ABACUS test Link image image image

WHUweiqingzhou commented 6 months ago

@pxlxingliang, is this case still unstable now? Can we close this issue?

pxlxingliang commented 6 months ago

I used the latest intel/gnu images with ks_solver genelpa and scalapack_gvx to run this example 10 times. The elpa can be stable in intel and gnu images intel: https://app.bohrium.dp.tech/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.109-e84632 gnu: https://app.bohrium.dp.tech/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.109-ddb135 [图片] 04b9d4e5218187fc1c388c54a2414f50_wh+LsOZHrjYAgAAAABJRU5ErkJggg== While for ks_solver scalapack_gvx, the drho has a large fluctuation intel: https://app.bohrium.dp.tech/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.109-2661e3 gnu: https://app.bohrium.dp.tech/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.109-48dbc5

image

pxlxingliang commented 6 months ago

I have tested the scalapack method on @jinzx10's commit with intel and gnu compiled abacus. Results of 10 runs with intel are stable, while gnu results are unstable. intel: https://app.bohrium.dp.tech/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.113-802aa8 image

gnu: https://app.bohrium.dp.tech/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.113-c7c692 f16d5b54f2765281a4b0f3585456dc53__fallback_source=1 height=1280 mount_node_token=JykRdaDQ5oLx2Tx7MVdcieSNn3d mount_point=docx_image policy=equal width=1280