Unstable LCAO calculation of 004_Li128C75H100O75

pxlxingliang commented 1 year ago

Describe the bug

The LCAO calculation of daily test 004_Li128C75H100O75 is unstable. 004_Li128C75H100O75.zip

At current version (20230921 develop branch), the intel compiled calculation is stable, but GNU compiled is unstable.

Details of some testing will be updated at below.

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers

[ ] Verify the issue is not a duplicate.
[ ] Describe the bug.
[ ] Steps to reproduce.
[ ] Expected behavior.
[ ] Error message.
[ ] Environment details.
[ ] Additional context.
[ ] Assign a priority level (low, medium, high, urgent).
[ ] Assign the issue to a team member.
[ ] Label the issue with relevant tags.
[ ] Identify possible related issues.
[ ] Create a unit test or automated test to reproduce the bug (if applicable).
[ ] Fix the bug.
[ ] Test the fix.
[ ] Update documentation (if necessary).
[ ] Close the issue and inform the reporter (if applicable).

pxlxingliang commented 1 year ago

Testing at 20230907: Using 20230907 develop codes, and test the running on different Bohrium machines and different OMP and MPI parallel strategies. The image is: "registry.dp.tech/deepmodeling/abacus-intel:latest" where use intel compiler. 6 different tests, and each type run repeated 10 times.

c32_m128_cpu/omp16 mpi1
c32_m128_cpu/omp1 mpi16
c32_m256_cpu/omp16 mpi1
c64_m256_cpu/omp32 mpi1
c32_m128_cpu/omp8 mpi1
c64_m256_cpu/omp16 mpi1 c32_m128_cpu means: hyper thread 32 (16 physical cores) and memory 128 G. omp16mpi1 means: running with OMP_THREADS=16 and mpi -np 1.

Detail results can be found in: https://labs.dp.tech/projects/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.23-32a3fd

drho of each runs: ba6fa4de1aa75d4ef7b5f9c6c675d8f9__fallback_source=1 height=1280 mount_node_token=EbJBdNdWAocuOVxH5QIckGUpnOh mount_point=docx_image policy=equal width=1280

As we can see, only parallel with mpi can keep the stable results, and is not related to the type of machine.

pxlxingliang commented 1 year ago

Testing at 9-11: Set fp-model to strict in intel compiling. machine: c32_m128_cpu image: registry.dp.tech/deepmodeling/abacus-intel:latest drho: a4bfa1ae9ba486ad97d4dc5c994cebcb__fallback_source=1 height=1280 mount_node_token=RlLWdlVn9oDOjLxsTGYcusn8nWf mount_point=docx_image policy=equal width=1280

Details: https://labs.dp.tech/projects/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-004test-abacustest-v0.3.29-797207

The 10 elpa are stable, and scalapack are unstable.

pxlxingliang commented 1 year ago

9-13: @jinzx10 modify the setting of orfac and work array in the calculation of scalapack. (@jinzx10 please supply the detail modification) The calculation of scalpack is much more stable. I tested on 10 c32_m128_cpu machines, the drho display two types of results, and same for elpa. 0bfce1118faceaff2f2eecbd6b409d55_d9bd537f-4004-43c3-92c9-b729200c3b19

https://labs.dp.tech/projects/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.31-abf0eb

Checked with Bohrium, the 10 jobs of elpa use the same machine type: ecs.u1-c1m4.8xlarge (ali), but we are not sure the cpu are exactly same.

I also run 10 jobs on one machine, and both elpa and scalapack display one type of results. e4b508dc7da1cd80fb80e24127ff529c_8db73254-bafb-49c8-b4a5-29aa4f8d1aee https://labs.dp.tech/projects/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.31-0b15d3

Collaborate with Bohrium, we run 10 jobs on 10 cpu "Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz", both elpa and scalapack has one type of results. be414c9194903c0b17cbe76ac6758a98_30ab3e4e-88ec-42ff-b130-996d07d3c11e https://labs.dp.tech/projects/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.31-494c1f

Conclusion: When using the modified setting of orfac and work array in the calculation of scalapack, the elpa and scalpack has two types of results, but can be stable on the same CPU (at least on Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz is stable).

pxlxingliang commented 1 year ago

9-13 testing on GNU compiler: Both elpa and scalpack are unstable on GNU compiler. Only test MPI parallel (the OMP parallel in GNU is terrible) 8d664f37c62dc929174a517d79d71cd9_d08c3ae0-bc1d-43d4-95e8-fbc469318b41 https://labs.dp.tech/projects/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fjob-abacustest-v0.3.30-b93f76

pxlxingliang commented 1 year ago

@jinzx10 @LiuXiaohui123321 The testing on 004_Li128C75H100O75 is updated here. If you have other testing results of this example, please also update here.

jinzx10 commented 1 year ago

The changes I made to the scalapack diagonalization interface (pdsygvx & pzhegvx in module_hsolver/diago_blas.cpp) include orfac and a few work-space-related parameters:

diff --git a/source/module_hsolver/diago_blas.cpp b/source/module_hsolver/diago_blas.cpp
index 489be9f09..8c725857c 100644
--- a/source/module_hsolver/diago_blas.cpp
+++ b/source/module_hsolver/diago_blas.cpp
@@ -62,7 +62,7 @@ std::pair<int, std::vector<int>> DiagoBlas::pdsygvx_once(const int *const desc,
     const int itype = 1, il = 1, iu = GlobalV::NBANDS, one = 1;
     int M = 0, NZ = 0, lwork = -1, liwork = -1, info = 0;
     double vl = 0, vu = 0;
-    const double abstol = 0, orfac = -1;
+    const double abstol = 0, orfac = 0.01;
     std::vector<double> work(3, 0);
     std::vector<int> iwork(1, 0);
     std::vector<int> ifail(GlobalV::NLOCAL, 0);
@@ -109,6 +109,10 @@ std::pair<int, std::vector<int>> DiagoBlas::pdsygvx_once(const int *const desc,
                                  + ModuleBase::GlobalFunc::TO_STRING(__LINE__));

     // GlobalV::ofs_running<<"lwork="<<work[0]<<"\t"<<"liwork="<<iwork[0]<<std::endl;
+
+    work[0] *= 10;
+    iwork[0] *= 10;
+
     lwork = work[0];
     work.resize(std::max(lwork,3), 0);
     liwork = iwork[0];
@@ -184,7 +188,7 @@ std::pair<int, std::vector<int>> DiagoBlas::pzhegvx_once(const int *const desc,
     const char jobz = 'V', range = 'I', uplo = 'U';
     const int itype = 1, il = 1, iu = GlobalV::NBANDS, one = 1;
     int M = 0, NZ = 0, lwork = -1, lrwork = -1, liwork = -1, info = 0;
-    const double abstol = 0, orfac = -1;
+    const double abstol = 0, orfac = 0.01;
     //Note: pzhegvx_ has a bug
     //      We must give vl,vu a value, although we do not use range 'V'
     //      We must give rwork at least a memory of sizeof(double) * 3
@@ -238,6 +242,12 @@ std::pair<int, std::vector<int>> DiagoBlas::pzhegvx_once(const int *const desc,
                                  + ModuleBase::GlobalFunc::TO_STRING(__LINE__));

     // GlobalV::ofs_running<<"lwork="<<work[0]<<"\t"<<"lrwork="<<rwork[0]<<"\t"<<"liwork="<<iwork[0]<<std::endl;
+
+    work[0] *= 10.0;
+    iwork[0] *= 10;
+    rwork[0] *= 10;
+
+
     lwork = work[0].real();
     work.resize(lwork, 0);
     lrwork = rwork[0] + this->degeneracy_max * GlobalV::NLOCAL;
@@ -402,4 +412,4 @@ void DiagoBlas::post_processing(const int info, const std::vector<int> &vec)
     }
 }

According to the source file (https://netlib.org/scalapack/explore-html/d7/dff/pzhegvx_8f_source.html), orthogonality of eigenvectors could be an issue if there are many eigenvectors with close eigenvalues. pzhegvx does provide ways to guarantee orthogonality, but it's very tricky and depends on a few parameters.

orfac is the threshold used to determine what eigenvectors are considered close enough that needs reorthogonalization. The default is 1e-3, which I changed to 1e-2 in the above test with modified scalapack. The size of work space array should also increase. In the test I simply increase them by a factor of 10, which should not be optimal and could be improved.

QuantumMisaka commented 1 year ago

Did Intel compiled ABACUS compiled by icpx? In my test, icpc compiled abacus will also be unstable （But the problem can also be in toolchain compile method）

jinzx10 commented 1 year ago

As a side note, I notice that abacus always solves the eigenvalue equation in the basis of all orbitals. I know that many quantum chemistry softwares using gaussian basis entail an extra canonical orthogonalization to "project out" basis orbitals that are almost linearly dependent. Resulting eigenvalue equations are usually more stable. Some explanation can be found in the qchem's manual (https://manual.q-chem.com/latest/sec_Basis_Customization.html) or Szabo & Ostlund's book (ch 3.4.5). I'm not sure if numerical atomic orbitals should use a similar strategy (and it inevitably complicates the code for MPI parallelization where matrices are stored in a block-cyclic format).

jinzx10 commented 1 year ago

Did Intel compiled ABACUS compiled by icpx? In my test, icpc compiled abacus will also be unstable （But the problem can also be in toolchain compile method）

I think it depends on some enviroment variables like CXX or I_MPI_CXX. Environment set up by current Dockerfile.intel would use icpx.

QuantumMisaka commented 1 year ago

As a side note, I notice that abacus always solves the eigenvalue equation in the basis of all orbitals. I know that many quantum chemistry softwares using gaussian basis entail an extra canonical orthogonalization to "project out" basis orbitals that are almost linearly dependent. Resulting eigenvalue equations are usually more stable. Some explanation can be found in the qchem's manual (https://manual.q-chem.com/latest/sec_Basis_Customization.html) or Szabo & Ostlund's book (ch 3.4.5). I'm not sure if numerical atomic orbitals should use a similar strategy (and it inevitably complicates the code for MPI parallelization where matrices are stored in a block-cyclic format).

In my recollection, ABACUS LCAO solve eigenvalue equation by directly doing einsum for generalized eigenvalue equation, (I don't know whether it is true), but it is sure that most of quantum chemistry software make this done by canonical orthogonalization or other orthogonalization method. I'm wondering for a while whether method is better.

hongriTianqi commented 1 year ago

This is related to input parameter and device.

WHUweiqingzhou commented 1 year ago

Hi all, Recently I am doing some tests for mixing, and I also notice that the unstable LCAO calculation of 004_Li128C75H100O75. I do the calculations 10 times for different mixing method. ABACUS test Link Interestingly, I found the calculations is relatively stable with Broyden mixing, while unstable cases occur with Pulay mixing. To double check, I do an extra calculation (20 times), and find Broyden calculations are also unstable. ABACUS test Link