deepmodeling / abacus-develop

An electronic structure package based on either plane wave basis or numerical atomic orbitals.
http://abacus.ustc.edu.cn
GNU Lesser General Public License v3.0
169 stars 129 forks source link

DCU tests failed #5214

Open pxlxingliang opened 1 week ago

pxlxingliang commented 1 week ago

Describe the Testing Issue

The daily dcu test failed on example 005_16Na at 20241011.

https://app.bohrium.dp.tech/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fsched-abacustest-dcu-cg-372d8a

The error message:

                              ABACUS v3.8.0

               Atomic-orbital Based Ab-initio Computation at UStc                    

                     Website: http://abacus.ustc.edu.cn/                             
               Documentation: https://abacus.deepmodeling.com/                       
                  Repository: https://github.com/abacusmodeling/abacus-develop       
                              https://github.com/deepmodeling/abacus-develop         
                      Commit: 5329628 (Thu Oct 10 22:45:13 2024 +0800)

 Fri Oct 11 00:28:57 2024

Info: Local MPI proc number: 4,OpenMP thread number: 1,Total thread number: 4,Local thread limit: 32
[j12r4n15:21269:0:21269] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
BFD: Dwarf Error: found dwarf version '5', this reader only handles version 2, 3 and 4 information.
==== backtrace (tid:  21269) ====
 0 0x0000000000051213 ucs_debug_print_backtrace()  /public/home/bujd/tmp/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.6-x86_64/sources/ucx-1.8.0/src/ucs/debug/debug.c:625
 1 0x000000000008559c __GI___libc_free()  :0
 2 0x0000000000c35f99 std::string::assign()  ???:0
 3 0x0000000000c360b9 std::string::assign()  ???:0
 4 0x0000000000c3783e std::string::assign()  ???:0
 5 0x0000000000bc8cec std::string::assign()  ???:0
 6 0x0000000000c20106 std::string::assign()  ???:0
 7 0x0000000000c8583a hipGetCmdName()  ???:0
 8 0x0000000000ca05ee hipGetDeviceCount()  ???:0
 9 0x0000000000453344 base_device::information::get_device_flag()  ???:0
10 0x0000000000183f08 std::_Function_handler<void (ModuleIO::Input_Item const&, Parameter&), ModuleIO::ReadInput::item_system()::$_169>::_M_invoke()  read_input_item_system.cpp:0
11 0x00000000001e97aa ModuleIO::ReadInput::read_txt_input()  ???:0
12 0x00000000001e90ac ModuleIO::ReadInput::read_parameters()  ???:0
13 0x0000000000250de5 Driver::reading()  ???:0
14 0x0000000000250c3d Driver::init()  ???:0
15 0x00000000000602d7 main()  ???:0
16 0x00000000000223d5 __libc_start_main()  ???:0
17 0x0000000000060160 _start()  ???:0
=================================
[j12r4n15:21269] *** Process received signal ***
[j12r4n15:21269] Signal: Segmentation fault (11)
[j12r4n15:21269] Signal code:  (-6)
[j12r4n15:21269] Failing at address: 0x62e000005315
[j12r4n15:21269] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b34ca07f5d0]
[j12r4n15:21269] [ 1] /lib64/libc.so.6(cfree+0x1c)[0x2b34d47bc59c]
[j12r4n15:21269] [ 2] /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5(+0xc35f99)[0x2b34cc079f99]
[j12r4n15:21269] [ 3] /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5(+0xc360b9)[0x2b34cc07a0b9]
[j12r4n15:21269] [ 4] /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5(+0xc3783e)[0x2b34cc07b83e]
[j12r4n15:21269] [ 5] /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5(+0xbc8cec)[0x2b34cc00ccec]
[j12r4n15:21269] [ 6] /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5(+0xc20106)[0x2b34cc064106]
[j12r4n15:21269] [ 7] /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5(+0xc8583a)[0x2b34cc0c983a]
[j12r4n15:21269] [ 8] /public/software/compiler/rocm/dtk-23.10/lib/libgalaxyhip.so.5(hipGetDeviceCount+0x17e)[0x2b34cc0e45ee]
[j12r4n15:21269] [ 9] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x453344)[0x55b838642344]
[j12r4n15:21269] [10] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x183f08)[0x55b838372f08]
[j12r4n15:21269] [11] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x1e97aa)[0x55b8383d87aa]
[j12r4n15:21269] [12] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x1e90ac)[0x55b8383d80ac]
[j12r4n15:21269] [13] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x250de5)[0x55b83843fde5]
[j12r4n15:21269] [14] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x250c3d)[0x55b83843fc3d]
[j12r4n15:21269] [15] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x602d7)[0x55b83824f2d7]
[j12r4n15:21269] [16] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b34d47593d5]
[j12r4n15:21269] [17] /public/home/abacus/abacus-dcu/build-dcu/abacus_pw(+0x60160)[0x55b83824f160]
[j12r4n15:21269] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 21269 on node j12r4n15 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Additional Context

No response

Task list for Issue attackers (only for developers)

WHUweiqingzhou commented 2 days ago

Recent dcu tests all passed. Maybe this issue is caused by machine problem. We will close this issue next meeting,