deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
170 stars 129 forks source link

do diagonalization (random H/S matrix) error by scalapack_gvx, when the parallel cores are large. #867

Closed pxlxingliang closed 2 years ago

pxlxingliang commented 2 years ago

Describe the Bug

I produced a random H and S matrix, whose dimension is only 100, and then I call Pdiag_Double::diago_double_begin() to solver the eigenvalues by setting "GlobalV::KS_SOLVER = scalapack_gvx". When I use MPI with more than 4 cores to do the solving, the function will throw a message like below: C++ exception with description "info = 2. /abacus/source/src_pdiag/diag_scalapack_gvx.cpp line 190. degeneracy_need = 92. degeneracy_saved = 92. " thrown in the test body. It is normal when cores are less than 4, and the eigenvalues are same with that solved by LAPACK (the maximum difference is 4.36594e-08). The ELPA gives the eigenvalues with the similar accuracy( maximum difference is 4.36557e-08).

Expected behavior

Scalapack can normal run when parallel cores are larger than 4.

To Reproduce

Steps to reproduce the behavior:

  1. modify file source/src_pdiag/test/diago_test.cpp
  2. add the example DiagoPrepare(100, 100, 1, 7, true, "scalapack_gvx", "", "") (in macro INSTANTIATE_TEST_SUITE_P)
  3. compile the test with BUILD_TESTING=ON: cmake -B build -DBUILD_TESTING=ON; cmake --build build --target hsolver_diago; cmake --install build
  4. run the test: cd build/source/src_pdiag/test/; mpirun -np 4 hsolver_diago

Environment

docker image dp-harbor-registry.cn-zhangjiakou.cr.aliyuncs.com/dplc/abacus:gnu

Additional Context

Add any other context about the problem here.

caic99 commented 2 years ago

Hi @pxlxingliang , I ran the test with address sanitizer. Unmodified test cases goes fine, and errors were reported for the case in your issue. Would you check again if the allocation of memory spaces for scalapack routine is correct?

==2965895==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6020000348b8 at pc 0x7f6afef01490 bp 0x7ffd64419830 sp 0x7ffd64418fd8
READ of size 24 at 0x6020000348b8 thread T0
    #0 0x7f6afef0148f in __interceptor_memcpy ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:790
    #1 0x7f6af48b36b2 in mca_btl_vader_sendi (/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_btl_vader.so+0x66b2)
    #2 0x7f6af48947cb  (/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so+0xb7cb)
    #3 0x7f6af48953e1 in mca_pml_ob1_isend (/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so+0xc3e1)
    #4 0x7f6afc01a9da in ompi_coll_base_bcast_intra_generic (/lib/x86_64-linux-gnu/libmpi.so.40+0xa29da)
    #5 0x7f6afc01b23c in ompi_coll_base_bcast_intra_binomial (/lib/x86_64-linux-gnu/libmpi.so.40+0xa323c)
    #6 0x7f6af4805d7a in ompi_coll_tuned_bcast_intra_dec_fixed (/usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so+0x5d7a)
    #7 0x7f6afbfddb0f in MPI_Bcast (/lib/x86_64-linux-gnu/libmpi.so.40+0x65b0f)
    #8 0x7f6afc4b42d9 in dgebs2d_ (/lib/x86_64-linux-gnu/libscalapack-openmpi.so.2.1+0x252d9)
    #9 0x7f6afc85612a in pdsygvx_ (/lib/x86_64-linux-gnu/libscalapack-openmpi.so.2.1+0x3c712a)
    #10 0x564758742b63 in Diag_Scalapack_gvx::pdsygvx_once(int const*, int, int, double const*, double const*, double*, ModuleBase::matrix&) const /home/cc/abacus-develop/source/src_pdiag/diag_scalapack_gvx.cpp:37
    #11 0x56475874474d in Diag_Scalapack_gvx::pdsygvx_diag(int const*, int, int, double const*, double const*, double*, ModuleBase::matrix&) /home/cc/abacus-develop/source/src_pdiag/diag_scalapack_gvx.cpp:158
    #12 0x564758734830 in Pdiag_Double::diago_double_begin(int const&, Local_Orbital_wfc&, double*, double*, double*, double*) /home/cc/abacus-develop/source/src_pdiag/pdiag_double.cpp:371
    #13 0x564758709a3f in DiagoPrepare::diago() /home/cc/abacus-develop/source/src_pdiag/test/diago_test.cpp:302
    #14 0x5647586ef343 in DiagoTest_LCAO_Test::TestBody() /home/cc/abacus-develop/source/src_pdiag/test/diago_test.cpp:343
    #15 0x56475887ef7f in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1edf7f)
    #16 0x564758877004 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1e6004)
    #17 0x564758852809 in testing::Test::Run() (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1c1809)
    #18 0x564758853290 in testing::TestInfo::Run() (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1c2290)
    #19 0x564758853bc6 in testing::TestSuite::Run() (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1c2bc6)
    #20 0x564758863874 in testing::internal::UnitTestImpl::RunAllTests() (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1d2874)
    #21 0x5647588804ad in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1ef4ad)
    #22 0x564758878208 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1e7208)
    #23 0x564758861e93 in testing::UnitTest::Run() (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1d0e93)
    #24 0x5647586e2629 in RUN_ALL_TESTS() /usr/local/include/gtest/gtest.h:2293
    #25 0x5647586e2629 in main /home/cc/abacus-develop/source/src_pdiag/test/diago_test.cpp:446
    #26 0x7f6afba5c082 in __libc_start_main ../csu/libc-start.c:308
    #27 0x5647586e618d in _start (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x5518d)

0x6020000348b8 is located 0 bytes to the right of 8-byte region [0x6020000348b0,0x6020000348b8)
allocated by thread T0 here:
    #0 0x7f6afef75587 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cc:104
    #1 0x5647587422b4 in Diag_Scalapack_gvx::pdsygvx_once(int const*, int, int, double const*, double const*, double*, ModuleBase::matrix&) const /usr/include/c++/9/ext/new_allocator.h:114
    #2 0x56475874474d in Diag_Scalapack_gvx::pdsygvx_diag(int const*, int, int, double const*, double const*, double*, ModuleBase::matrix&) /home/cc/abacus-develop/source/src_pdiag/diag_scalapack_gvx.cpp:158
    #3 0x564758734830 in Pdiag_Double::diago_double_begin(int const&, Local_Orbital_wfc&, double*, double*, double*, double*) /home/cc/abacus-develop/source/src_pdiag/pdiag_double.cpp:371
    #4 0x564758709a3f in DiagoPrepare::diago() /home/cc/abacus-develop/source/src_pdiag/test/diago_test.cpp:302
    #5 0x5647586ef343 in DiagoTest_LCAO_Test::TestBody() /home/cc/abacus-develop/source/src_pdiag/test/diago_test.cpp:343
    #6 0x56475887ef7f in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1edf7f)
    #7 0x564758877004 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1e6004)
    #8 0x564758852809 in testing::Test::Run() (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1c1809)
    #9 0x564758853290 in testing::TestInfo::Run() (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1c2290)
    #10 0x564758853bc6 in testing::TestSuite::Run() (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1c2bc6)
    #11 0x564758863874 in testing::internal::UnitTestImpl::RunAllTests() (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1d2874)
    #12 0x5647588804ad in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1ef4ad)
    #13 0x564758878208 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1e7208)
    #14 0x564758861e93 in testing::UnitTest::Run() (/home/cc/abacus-develop/build/source/src_pdiag/test/hsolver_diago+0x1d0e93)
    #15 0x5647586e2629 in RUN_ALL_TESTS() /usr/local/include/gtest/gtest.h:2293
    #16 0x5647586e2629 in main /home/cc/abacus-develop/source/src_pdiag/test/diago_test.cpp:446
    #17 0x7f6afba5c082 in __libc_start_main ../csu/libc-start.c:308

SUMMARY: AddressSanitizer: heap-buffer-overflow ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:790 in __interceptor_memcpy
Shadow bytes around the buggy address:
  0x0c047fffe8c0: fa fa fd fa fa fa fd fa fa fa fd fa fa fa fd fa
  0x0c047fffe8d0: fa fa fd fd fa fa 06 fa fa fa 06 fa fa fa 06 fa
  0x0c047fffe8e0: fa fa 06 fa fa fa 06 fa fa fa 00 00 fa fa fd fa
  0x0c047fffe8f0: fa fa fd fa fa fa fd fa fa fa fd fa fa fa fd fa
  0x0c047fffe900: fa fa fd fd fa fa fd fa fa fa fd fa fa fa 00 00
=>0x0c047fffe910: fa fa 00 fa fa fa 00[fa]fa fa 04 fa fa fa 00 00
  0x0c047fffe920: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c047fffe930: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c047fffe940: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c047fffe950: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c047fffe960: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==2965895==ABORTING
caic99 commented 2 years ago

This error seems being raised here: https://github.com/deepmodeling/abacus-develop/blob/fd17cd2620c2cc036170ad85d3ccd1ba941b1bbc/source/src_pdiag/diag_scalapack_gvx.cpp#L213-L222

It may caused by the incorrect use of pdsygvx: https://github.com/deepmodeling/abacus-develop/blob/fd17cd2620c2cc036170ad85d3ccd1ba941b1bbc/source/src_pdiag/diag_scalapack_gvx.cpp#L156-L163

Please refer to #491 .

pxlxingliang commented 2 years ago

I have re-try the example in latest ABACUS (because of the refactor of HSOLVER, the unit test file is source/module_hsolver/test/diago_lcao_test.cpp), the problem is not occurred this time. Besides, the H matrix is produced by random and is not for a real system, let's close this issue now, and if this problem occurs in other real system, we will try to study on that specific system.