deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
164 stars 128 forks source link

Diagonalization error of the `genelpa` solver #3346

Closed hongriTianqi closed 9 months ago

hongriTianqi commented 9 months ago

Describe the bug

From 2023-11-29 to 2023-12-12, the variation of the total energy with magnetic moments was observed to be inconsistent between colliear and noncollinear calculations for the AFM state of BCC Fe, where the magnetic moments were constrained by the newly implemented DeltaSpin method (PR #3050, PR #3220).

I took many pains to debug. From the beginning, this phenomenon was found in BCC Fe, as reported in ISSUE #3272. But this system is not simple enough for debugging. Then, I designed the Fe dimer system to confirm this bug. During debugging, the total energy comparison between QE and ABACUS were done. The initialization of rho in noncollinear calculation was corrected to make the scf initialization consistent between noncollinear and collinear calculations; the reading format of magnetic moments in the noncollinear case was corrected (PR #3308).

After correcting the above two bugs, the AFM energy of Fe dimer was confirmed again to be problematic, that is, inconsistent again between noncollinear and collinear calculation. Under the suggestion of @dyzheng , the mixing parameters were all set to be zero, and mixing method to plain, and then compared the change of energy again. The inconsistency between RMS during lambda loop of DeltaSpin calculation was observed from the second step already. Then the H and S matrices were compared element by element before being fed into the elpa solver in the file module_hsolver/diago_elpa.cpp. Comparing the results between collinear and non-collinear, the H and S are exactly the same, but the eigen values are different by 1e-4 eV. While the FM state results were consistent, with eigenvalue difference under the order of 1e-8 eV.

Finally, we tried to set the ks_solver to scalapack_gvx, and got consistent energy results between collinear and noncollinear results in Fe dimer. This debug process was so long for us to realize that the genelpa solver may result in unphysical results.

Therefore, we need help on improving the genelpa solver for modeling the AFM states of materials in non-collinear calculations. Apart from this one, some other issues (#3292 and #3259, #3351) are known to be related to the performance of the genelpa solver.

Expected behavior

genelpa should give consistent results between collinear and non-collinear calculations.

To Reproduce

bug_20231212.tar.gz

After unzip the attached file, H matrix can be found here: bug_20231212/v3.10-spin2-nomixing/compare_matrix/matrix4 with size 108x108 S matrix can be found here: bug_20231212/v3.10-spin2-nomixing/compare_matrix/smat4 with size 108x108

The correct eigen values from collinear calculation are in bug_20231212/v3.10-spin2-nomixing/compare_ekb/eigen_afm from line 508

eigen[0] = -7.0496197286e+00
 eigen[1] = -6.6252172952e+00
 eigen[2] = -4.5558311362e+00
 eigen[3] = -4.5408877963e+00
 eigen[4] = -4.5408877963e+00
 eigen[5] = -4.1633494236e+00
 eigen[6] = -4.1485455227e+00
 eigen[7] = -4.1485455227e+00
 eigen[8] = -7.0683726593e-01
 eigen[9] = -6.7668985278e-01
 eigen[10] = -6.7668985278e-01
 eigen[11] = -6.4899298753e-01
 eigen[12] = -6.4899188367e-01
 eigen[13] = -5.0643617332e-01
 eigen[14] = -3.5909970854e-01
 eigen[15] = -3.5909970854e-01
 eigen[16] = -3.4807038100e-01
 eigen[17] = -3.4183145852e-01
 eigen[18] = -3.4183030586e-01
 eigen[19] = -2.3707575960e-01
 eigen[20] = -5.7572916381e-02
 eigen[21] = -5.7572916381e-02
 eigen[22] = -1.1904149574e-02
 eigen[23] = 1.7593787134e-01

The wrong eigen values from non-collinear calculation are in /data/work/debug/bug_20231212/v3.10-spin4-nomixing/compare_ekb/eigen_afm from line 500:

eigen[0] = -7.0496114863e+00
 eigen[1] = -7.0495844460e+00
 eigen[2] = -6.6252001971e+00
 eigen[3] = -6.6251967049e+00
 eigen[4] = -4.5558365185e+00
 eigen[5] = -4.5558294691e+00
 eigen[6] = -4.5408953525e+00
 eigen[7] = -4.5408910426e+00
 eigen[8] = -4.5408849783e+00
 eigen[9] = -4.5408473111e+00
 eigen[10] = -4.1633413307e+00
 eigen[11] = -4.1633064255e+00
 eigen[12] = -4.1485547190e+00
 eigen[13] = -4.1485435051e+00
 eigen[14] = -4.1485432827e+00
 eigen[15] = -4.1485312913e+00
 eigen[16] = -7.0683512292e-01
 eigen[17] = -7.0678160647e-01
 eigen[18] = -6.7669156366e-01
 eigen[19] = -6.7667865683e-01
 eigen[20] = -6.7667534076e-01
 eigen[21] = -6.7666132653e-01

Environment

No response

Additional Context

The matrix and eigen values were printed in the following way in the file module_hsolver/diago_elpa.cpp:

    for (int i=0; i< GlobalV::NLOCAL; ++i)
    {
        for (int j=0; j<GlobalV::NLOCAL; ++j)
        {
            if ((i+j)%2 != 0)
            {
                h_mat.p[i*GlobalV::NLOCAL+j] = 0.0; // here I have tried to manually set off-diagonal element of h matrix to zero
            }
        }
    }
    es.generalized_eigenvector(h_mat.p, s_mat.p, this->DecomposedState, eigen.data(), psi.get_pointer());
    ModuleBase::timer::tick("DiagoElpa", "elpa_solve");
    es.exit();

    const int inc = 1;
    BlasConnector::copy(GlobalV::NBANDS, eigen.data(), inc, eigenvalue_in, inc);
    for (int iw = 0; iw < GlobalV::NLOCAL; ++iw)
    {
        std::cout << " eigen[" << iw << "] = " << eigen[iw] << std::endl;
    }

Task list for Issue attackers (only for developers)

hongriTianqi commented 9 months ago

one can use the following lines to print out H and S matrices for use in the test of diago_lcao_test.cpp

template <typename T>
void HSolverLCAO<T>::hamiltSolvePsiK(hamilt::Hamilt<T>* hm, psi::Psi<T>& psi, double* eigenvalue)
{
    ModuleBase::TITLE("HSolverLCAO", "hamiltSolvePsiK");
    ModuleBase::timer::tick("HSolverLCAO", "hamiltSolvePsiK");
    hamilt::MatrixBlock<T> h_mat, s_mat;
    hm->matrix(h_mat, s_mat);
    std::string file_name = "check_hs";
    ModuleIO::saving_HS(0, h_mat.p, s_mat.p, false, 1, file_name, *(this->ParaV), true);
    this->pdiagh->diag(hm, psi, eigenvalue);
    ModuleBase::timer::tick("HSolverLCAO", "hamiltSolvePsiK");
}
hongriTianqi commented 8 months ago

Finally, this is not actually an issue of the elpa solver. The problem is that the H_lambda is not hermitian, resulting in error in elpa diagonalization. The scalapack uses the upper triangle matrix, and avoided the same problem as met in the Fe dimer AFM state calculation.