Algebraic-Programming / LPF

A minimal communication layer for the implementation of immortal algorithms and for facilitating their broad use.
Apache License 2.0
5 stars 1 forks source link

LPF master failing IB Verbs functional tests #24

Open KADichev opened 3 weeks ago

KADichev commented 3 weeks ago

IB Verbs - master branch - fails IB Verbs tests on an IB Verbs enabled ARM node.

MPICH via Spack:


mpirun --version
HYDRA build details:
    Version:                                 4.2.2
    Release Date:                            Wed Jul  3 09:16:22 AM CDT 2024
    CC:                              /home/kdichev/spack/lib/spack/env/gcc/gcc      
    Configure options:                       '--disable-option-checking' '--prefix=/storage/users/kdichev/.spack/opt/spack/linux-ubuntu22.04-aarch64/gcc-11.4.0/mpich-4.2.2-obfj6bwsqcspbyonhdeqanjwb4r6hli5' '--disable-maintainer-mode' '--disable-silent-rules' '--enable-shared' '--with-pm=hydra' '--enable-romio' '--without-ibverbs' '--enable-wrapper-rpath=yes' '--with-yaksa=/storage/users/kdichev/.spack/opt/spack/linux-ubuntu22.04-aarch64/gcc-11.4.0/yaksa-0.3-eq7l7drd5agbnooxmhvlmj2ufi525qlo' '--with-hwloc=/storage/users/kdichev/.spack/opt/spack/linux-ubuntu22.04-aarch64/gcc-11.4.0/hwloc-2.9.3-c62ichcv4wlqidoeaxesejoptnb35exi' '--with-slurm=no' '--without-cuda' '--without-hip' '--with-device=ch4:ofi' '--with-libfabric=/storage/users/kdichev/.spack/opt/spack/linux-ubuntu22.04-aarch64/gcc-11.4.0/libfabric-1.21.0-kqh22v2p7by2wfbsw2boehmni4ddvs3s' '--enable-libxml2' '--with-datatype-engine=auto' 'CC=/home/kdichev/spack/lib/spack/env/gcc/gcc' 'CXX=/home/kdichev/spack/lib/spack/env/gcc/g++' 'FC=/home/kdichev/spack/lib/spack/env/gcc/gfortran' 'F77=/home/kdichev/spack/lib/spack/env/gcc/gfortran' '--cache-file=/dev/null' '--srcdir=.' 'CFLAGS= -O2' 'LDFLAGS= -L/storage/users/kdichev/.spack/opt/spack/linux-ubuntu22.04-aarch64/gcc-11.4.0/libfabric-1.21.0-kqh22v2p7by2wfbsw2boehmni4ddvs3s/lib -L/storage/users/kdichev/.spack/opt/spack/linux-ubuntu22.04-aarch64/gcc-11.4.0/hwloc-2.9.3-c62ichcv4wlqidoeaxesejoptnb35exi/lib -L/storage/users/kdichev/.spack/opt/spack/linux-ubuntu22.04-aarch64/gcc-11.4.0/yaksa-0.3-eq7l7drd5agbnooxmhvlmj2ufi525qlo/lib' 'LIBS=' 'CPPFLAGS= -I/storage/users/kdichev/.spack/opt/spack/linux-ubuntu22.04-aarch64/gcc-11.4.0/libfabric-1.21.0-kqh22v2p7by2wfbsw2boehmni4ddvs3s/include -DNETMOD_INLINE=__netmod_inline_ofi__ -I/scratch/kdichev/.spack/stage/spack-stage-mpich-4.2.2-obfj6bwsqcspbyonhdeqanjwb4r6hli5/spack-src/src/mpl/include -I/scratch/kdichev/.spack/stage/spack-stage-mpich-4.2.2-obfj6bwsqcspbyonhdeqanjwb4r6hli5/spack-src/modules/json-c -I/storage/users/kdichev/.spack/opt/spack/linux-ubuntu22.04-aarch64/gcc-11.4.0/hwloc-2.9.3-c62ichcv4wlqidoeaxesejoptnb35exi/include -D_REENTRANT -I/scratch/kdichev/.spack/stage/spack-stage-mpich-4.2.2-obfj6bwsqcspbyonhdeqanjwb4r6hli5/spack-src/src/mpi/romio/include -I/scratch/kdichev/.spack/stage/spack-stage-mpich-4.2.2-obfj6bwsqcspbyonhdeqanjwb4r6hli5/spack-src/src/pmi/include -I/storage/users/kdichev/.spack/opt/spack/linux-ubuntu22.04-aarch64/gcc-11.4.0/yaksa-0.3-eq7l7drd5agbnooxmhvlmj2ufi525qlo/include'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Demux engines available:                 poll select
kdichev@srv02:~$ spack find mpich
==> In environment arm
==> 6 root specs
 -  hwloc  [+] lpf-hicr  [e] meson  [+] mpich  [+] openmpi  [+] openssh

-- linux-ubuntu22.04-aarch64 / gcc@11.4.0 -----------------------
mpich@4.2.2
==> 1 installed package
==> 0 concretized packages to be installed (show with `spack find -c`)

Configuration step:

../bootstrap.sh --functests=i-agree-with-googletest-license
.____   _____________________
|    |  \______   \_   _____/
|    |   |     ___/|    __)  
|    |___|    |    |     \   
|_______ \____|    \___  /   
        \/             \/    

Lightweight Parallel Foundations

Copyright (c) 2016-2021 by Huawei Technologies
All rights reserved.

BUILD BOOTSTRAP SCRIPT
======================

Configuring LPF build
--------------------------------------------------

-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- The ASM compiler identification is GNU
-- Found assembler: /usr/bin/cc
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Installation directory prefix is /home/kdichev/LPF/build
-- Governing C++ standard is C++11
-- Found hwloc library
-- Hwloc library will be used
-- Found MPI_C: /home/kdichev/spack/var/spack/environments/arm/.spack-env/view/lib/libmpi.so (found version "4.1") 
-- Found MPI_CXX: /home/kdichev/spack/var/spack/environments/arm/.spack-env/view/lib/libmpicxx.so (found version "4.1") 
-- Found MPI: TRUE (found version "4.1")  
-- The following engines will be built: pthread;mpimsg;mpirma;ibverbs;hybrid
-- Unit and API tests will be built
-- Hybrid implementation's multi-node layer is 'ibverbs'
-- Configuring done

make-- Generating done
-- Build files have been written to: /home/kdichev/LPF/build

DONE

Configuration Options 
------------------------------------------------------

 - Build directory        = /home/kdichev/LPF/build

 - Installation directory = /home/kdichev/LPF/build

 - Build configuration    = Release

 - Build documentation    = OFF

 - Functional Tests       = ON

 - Performance Tests      = OFF

--- Note:
To build this project run 'make', to install 'make install'.  If you are
the lucky owner of a multi-core processor, consider using the '-j' option
of make.

Compilation step:

ctest
Test project /home/kdichev/LPF/build
      Start  1: time_test
 1/31 Test  #1: time_test ........................   Passed    0.01 sec
      Start  2: memreg_test
 2/31 Test  #2: memreg_test ......................   Passed    0.01 sec
      Start  3: sparseset_test
 3/31 Test  #3: sparseset_test ...................   Passed    0.01 sec
      Start  4: dynamichook_1proc
 4/31 Test  #4: dynamichook_1proc ................   Passed    0.16 sec
      Start  5: dynamichook_2proc
 5/31 Test  #5: dynamichook_2proc ................   Passed    0.68 sec
      Start  6: dynamichook_3proc
 6/31 Test  #6: dynamichook_3proc ................   Passed    0.30 sec
      Start  7: dynamichook_10proc
 7/31 Test  #7: dynamichook_10proc ...............   Passed    0.76 sec
      Start  8: ibverbs_test_1
 8/31 Test  #8: ibverbs_test_1 ...................***Failed   16.58 sec
      Start  9: ibverbs_test_2
 9/31 Test  #9: ibverbs_test_2 ...................***Failed   19.18 sec
      Start 10: ibverbs_test_5
10/31 Test #10: ibverbs_test_5 ...................***Failed   27.91 sec
      Start 11: ibverbs_test_10
11/31 Test #11: ibverbs_test_10 ..................***Failed    1.04 sec
      Start 12: spall2all_test_1
12/31 Test #12: spall2all_test_1 .................   Passed   26.28 sec
      Start 13: spall2all_test_2
13/31 Test #13: spall2all_test_2 .................   Passed   27.57 sec
      Start 14: spall2all_test_5
14/31 Test #14: spall2all_test_5 .................   Passed   34.59 sec
      Start 15: spall2all_test_10
15/31 Test #15: spall2all_test_10 ................   Passed   37.74 sec
      Start 16: dall2all_test_1
16/31 Test #16: dall2all_test_1 ..................   Passed    0.20 sec
      Start 17: dall2all_test_2
17/31 Test #17: dall2all_test_2 ..................   Passed    0.21 sec
      Start 18: dall2all_test_5
18/31 Test #18: dall2all_test_5 ..................   Passed    0.34 sec
      Start 19: dall2all_test_10
19/31 Test #19: dall2all_test_10 .................   Passed    0.59 sec
      Start 20: hall2all_test_1
20/31 Test #20: hall2all_test_1 ..................   Passed   25.67 sec
      Start 21: hall2all_test_2
21/31 Test #21: hall2all_test_2 ..................   Passed   26.56 sec
      Start 22: hall2all_test_5
22/31 Test #22: hall2all_test_5 ..................   Passed   29.90 sec
      Start 23: hall2all_test_10
23/31 Test #23: hall2all_test_10 .................   Passed   29.64 sec
      Start 24: messagesort_test
24/31 Test #24: messagesort_test .................   Passed    0.01 sec
      Start 25: ipcmesg_test
25/31 Test #25: ipcmesg_test .....................   Passed    0.01 sec
      Start 26: rwconflict_test
26/31 Test #26: rwconflict_test ..................   Passed    0.01 sec
      Start 27: API_pthread_Release
 27/31 Test #27: API_pthread_Release ..............   Passed  391.16 sec
      Start 28: API_mpimsg_Release
 28/31 Test #28: API_mpimsg_Release ...............***Failed  970.88 sec
      Start 29: API_mpirma_Release
29/31 Test #29: API_mpirma_Release ...............***Failed  1138.01 sec
      Start 30: API_ibverbs_Release
30/31 Test #30: API_ibverbs_Release ..............***Failed  1068.34 sec
      Start 31: API_hybrid_Release
31/31 Test #31: API_hybrid_Release ...............***Failed  1083.44 sec

74% tests passed, 8 tests failed out of 31

Total Test time (real) = 4957.87 sec

The following tests FAILED:
      8 - ibverbs_test_1 (Failed)
      9 - ibverbs_test_2 (Failed)
     10 - ibverbs_test_5 (Failed)
     11 - ibverbs_test_10 (Failed)
     28 - API_mpimsg_Release (Failed)
     29 - API_mpirma_Release (Failed)
     30 - API_ibverbs_Release (Failed)
     31 - API_hybrid_Release (Failed)
Errors while running CTest
Output from these tests are in: /home/kdichev/LPF/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
KADichev commented 3 weeks ago

Allocation in IB Verbs fails

[ RUN      ] IBVerbs.manyPuts

Thread 1 "ibverbs_test" hit Catchpoint 1 (exception thrown), 0x0000400000a32dbc in __cxa_throw () from /home/kdichev/spack/var/spack/environments/arm/.spack-env/view/lib/libstdc++.so.6
(gdb) bt
#0  0x0000400000a32dbc in __cxa_throw ()
   from /home/kdichev/spack/var/spack/environments/arm/.spack-env/view/lib/libstdc++.so.6
#1  0x0000aaaaaab243d8 in lpf::mpi::IBVerbs::stageQPs (this=0xffffffffd948, 
    maxMsgs=100000) at /home/kdichev/LPF/src/MPI/ibverbs.cpp:247
#2  0x0000aaaaaab26470 in lpf::mpi::IBVerbs::resizeMesgq (this=0xffffffffd948, 
    size=100000) at /home/kdichev/LPF/src/MPI/ibverbs.cpp:435
#3  0x0000aaaaaab1b6f0 in IBVerbs_manyPuts_Test::TestBody (this=0xaaaaaacdeef0)
    at /home/kdichev/LPF/src/MPI/ibverbs.t.cpp:298
#4  0x0000aaaaaab7e228 in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#5  0x0000aaaaaab776e4 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#6  0x0000aaaaaab551b8 in testing::Test::Run() ()
#7  0x0000aaaaaab55a54 in testing::TestInfo::Run() ()
#8  0x0000aaaaaab56084 in testing::TestCase::Run() ()
#9  0x0000aaaaaab5ffc8 in testing::internal::UnitTestImpl::RunAllTests() ()
#10 0x0000aaaaaab7f1d0 in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) ()
#11 0x0000aaaaaab78644 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) ()
#12 0x0000aaaaaab5ecc4 in testing::UnitTest::Run() ()
#13 0x0000aaaaaab4d464 in RUN_ALL_TESTS() ()
#14 0x0000aaaaaab4d3cc in main ()
(gdb) f 2
#2  0x0000aaaaaab26470 in lpf::mpi::IBVerbs::resizeMesgq (this=0xffffffffd948, 
    size=100000) at /home/kdichev/LPF/src/MPI/ibverbs.cpp:435
435     stageQPs(size);
(gdb) p size
$1 = 100000
(gdb)