DARMA-tasking / vt

DARMA/vt => Virtual Transport
Other
35 stars 9 forks source link

Bug in GreedyLB (assertion failure) #958

Closed lifflander closed 3 years ago

lifflander commented 4 years ago

Describe the bug I've reproduced this in two contexts. Either in the docker container (GNU gcc-7, debug) or on my Mac with (clang-5).

Run this to reproduce:

ctest -I 157,158 --repeat-until-fail 1000 --output-on-failure .

Assertion that breaks:

vt: [0] lb: LBManager::releaseNow: finished LB, phase=3, invocations=1
vt: [0] lb: BaseLB: Statistic=P_l:  max=5.10, min=4.55, sum=19.24, avg=4.81, var=0.04, stdev=0.20, nproc=4, cardinality=4 skewness=0.17, kurtosis=-1.87, npr=4, imb=0.06, num_stats=1
vt: [0] lb: BaseLB: Statistic=O_l:  max=0.001, min=0.000, sum=0.02, avg=0.000, var=0.000, stdev=0.000, nproc=64, cardinality=64 skewness=0.02, kurtosis=-1.25, npr=64, imb=1.06, num_stats=2
vt: [0] lb: loadStats: load=4.55, total=19.24, avg=4.81, I=0.06,should_lb=true, auto=true, threshold=0.9390901317338556
vt: [1] ------------------------------------------------------------------------------------------------------------------------
vt: [1] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [1] ------------------------------------------------ Fatal Error on Node 1 -------------------------------------------------
vt: [1] ------------------------------------------------------------------------------------------------------------------------
vt: [1]
vt: [1]              Reason: Must have object
vt: [1]    Assertion failed: (theProcStats()->hasObjectToMigrate(obj_id))
vt: [1]                Node: 1
vt: [1]           Num Nodes: 4
vt: [1]                File: /vt/src/vt/vrt/collection/balance/baselb/baselb.cc
vt: [1]                Line: 230
vt: [1]            Function: transferMigrations
vt: [1]                Code: 1
vt: [1]           Build SHA: 181e188d3fca91bab0a2d0efc765d8366031e5da
vt: [1]           Build Ref: refs/heads/develop
vt: [1]         Description: heads/develop-0-g181e188d3f
vt: [1]            GIT Repo: *dirty*
vt: [1]            Hostname: 41fe2b81da16
vt: [1]
vt: [2] ------------------------------------------------------------------------------------------------------------------------
vt: [2] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [2] ------------------------------------------------ Fatal Error on Node 2 -------------------------------------------------
vt: [2] ------------------------------------------------------------------------------------------------------------------------
vt: [2]
vt: [2]              Reason: Must have object
vt: [2]    Assertion failed: (theProcStats()->hasObjectToMigrate(obj_id))
vt: [2]                Node: 2
vt: [2]           Num Nodes: 4
vt: [2]                File: /vt/src/vt/vrt/collection/balance/baselb/baselb.cc
vt: [2]                Line: 230
vt: [2]            Function: transferMigrations
vt: [2]                Code: 1
vt: [2]           Build SHA: 181e188d3fca91bab0a2d0efc765d8366031e5da
vt: [2]           Build Ref: refs/heads/develop
vt: [2]         Description: heads/develop-0-g181e188d3f
vt: [2]            GIT Repo: *dirty*
vt: [2]            Hostname: 41fe2b81da16
vt: [2]
vt: [3] ------------------------------------------------------------------------------------------------------------------------
vt: [3] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [3] ------------------------------------------------ Fatal Error on Node 3 -------------------------------------------------
vt: [3] ------------------------------------------------------------------------------------------------------------------------
vt: [3]
vt: [3]              Reason: Must have object
vt: [3]    Assertion failed: (theProcStats()->hasObjectToMigrate(obj_id))
vt: [3]                Node: 3
vt: [3]           Num Nodes: 4
vt: [3]                File: /vt/src/vt/vrt/collection/balance/baselb/baselb.cc
vt: [3]                Line: 230
vt: [3]            Function: transferMigrations
vt: [3]                Code: 1
vt: [3]           Build SHA: 181e188d3fca91bab0a2d0efc765d8366031e5da
vt: [3]           Build Ref: refs/heads/develop
vt: [3]         Description: heads/develop-0-g181e188d3f
vt: [3]            GIT Repo: *dirty*
vt: [3]            Hostname: 41fe2b81da16
vt: [3]
vt: [3] ------------------------------------------------------------------------------------------------------------------------
vt: [3] -------------------------------------------- Dump Stack Backtrace on Node 3 --------------------------------------------
vt: [3] ------------------------------------------------------------------------------------------------------------------------
vt: [3] 0   18  0x55be2ff00548 vt::debug::stack::dumpStack[abi:cxx11](int) + 83
vt: [3] 1   18  0x55be2fb00c98 vt::runtime::Runtime::output(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, bool, bool, bool) + 1868
vt: [3] 2   18  0x55be2f99e3cf vt::CollectiveAnyOps<(vt::runtime::eRuntimeInstance)0>::output(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, bool, bool, bool, bool) + 209
vt: [3] 3   18  0x55be2f99d163 vt::output(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, bool, bool, bool, bool) + 143
vt: [3] 4   18  0x55be2f78b85f std::enable_if<std::tuple_size<std::tuple<> >::value==(0), void>::type vt::debug::assert::assertOut<>(bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::tuple<>&&) + 359
vt: [3] 5   18  0x55be30102aae vt::vrt::collection::lb::BaseLB::transferMigrations(vt::vrt::collection::lb::TransferMsg<std::vector<std::tuple<unsigned long, short>, std::allocator<std::tuple<unsigned long, short> > > >*) + 682
vt: [3] 6   18  0x55be2fcfe1e6 vt::objgroup::dispatch::Dispatch<vt::vrt::collection::lb::BaseLB>::run(long, vt::messaging::BaseMsg*) + 920
vt: [3] 7   18  0x55be2fd137b6 vt::objgroup::ObjGroupManager::dispatch(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >, long) + 860
vt: [3] 8   18  0x55be2fd142c8 vt::objgroup::dispatchObjGroup(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >, long) + 150
vt: [3] 9   18  0x55be2f7fcb1f vt::runnable::Runnable<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >::runObj(long, vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope>*, short) + 725
vt: [3] 10  18  0x55be2fd143ab ./test_lb_extended(+0x1d393ab) [0x55be2fd143ab] + 0
vt: [3] 11  18  0x55be2fd147bf ./test_lb_extended(+0x1d397bf) [0x55be2fd147bf] + 0
vt: [3] 12  18  0x55be2f79136b std::function<void ()>::operator()() const + 77
vt: [3] 13  18  0x55be2feaee2d vt::sched::PriorityUnit::execute() + 467
vt: [3] 14  18  0x55be2feaec4d vt::sched::PriorityUnit::operator()() + 33
vt: [3] 15  18  0x55be2fea803f vt::sched::Scheduler::runWorkUnit(vt::sched::PriorityUnit&) + 691
vt: [3] 16  18  0x55be2fea8a1e vt::sched::Scheduler::scheduler(bool) + 566
vt: [3] 17  18  0x55be2fea8f75 vt::sched::Scheduler::runSchedulerWhile(std::function<bool ()>) + 845
vt: [3] 18  18  0x55be2feaa06b vt::runSchedulerThrough(unsigned long) + 145
vt: [3] 19  18  0x55be2feaa4f1 vt::runInEpochCollective(std::function<void ()>&&) + 437
vt: [3] 20  18  0x55be2fcc379c void vt::vrt::collection::balance::LBManager::makeLB<vt::vrt::collection::lb::GreedyLB>(vt::messaging::MsgSharedPtr<vt::vrt::collection::balance::StartLBMsg>) + 702
vt: [3] 21  18  0x55be2fc98fd0 vt::vrt::collection::balance::LBManager::collectiveImpl(unsigned long, vt::vrt::collection::balance::LBType, bool, unsigned long) + 738
vt: [3] 22  18  0x55be2f86c742 void vt::vrt::collection::balance::LBManager::sysLB<vt::vrt::collection::balance::InvokeBaseMsg<vt::collective::reduce::operators::ReduceTMsg<char> > >(vt::vrt::collection::balance::InvokeBaseMsg<vt::collective::reduce::operators::ReduceTMsg<char> >*) + 214
vt: [3] 23  18  0x55be2fcfe7cc vt::objgroup::dispatch::Dispatch<vt::vrt::collection::balance::LBManager>::run(long, vt::messaging::BaseMsg*) + 920
vt: [3] 24  18  0x55be2fd137b6 vt::objgroup::ObjGroupManager::dispatch(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >, long) + 860
vt: [3] 25  18  0x55be2fd142c8 vt::objgroup::dispatchObjGroup(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >, long) + 150
vt: [3] 26  18  0x55be2f7fcb1f vt::runnable::Runnable<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >::runObj(long, vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope>*, short) + 725
vt: [3] 27  18  0x55be2f7e8924 vt::runnable::Runnable<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >::run(long, void (*)(vt::messaging::BaseMsg*), vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope>*, short, int) + 144
vt: [3] 28  18  0x55be2fd4e7b3 vt::messaging::ActiveMessenger::deliverActiveMsg(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> > const&, short const&, bool, std::function<void ()>) + 1821
vt: [3] 29  18  0x55be2fd4dfa6 vt::messaging::ActiveMessenger::processActiveMsg(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> > const&, short const&, int const&, bool, std::function<void ()>) + 476
vt: [3] 30  18  0x55be2fd4d853 ./test_lb_extended(+0x1d72853) [0x55be2fd4d853] + 0
vt: [3] 31  18  0x55be2fd51a52 ./test_lb_extended(+0x1d76a52) [0x55be2fd51a52] + 0
vt: [3] 32  18  0x55be2f79136b std::function<void ()>::operator()() const + 77
vt: [3] 33  18  0x55be2feaee2d vt::sched::PriorityUnit::execute() + 467
vt: [3] 34  18  0x55be2feaec4d vt::sched::PriorityUnit::operator()() + 33
vt: [3] 35  18  0x55be2fea803f vt::sched::Scheduler::runWorkUnit(vt::sched::PriorityUnit&) + 691
vt: [3] 36  18  0x55be2fea8a1e vt::sched::Scheduler::scheduler(bool) + 566
vt: [3] 37  18  0x55be2fea8f75 vt::sched::Scheduler::runSchedulerWhile(std::function<bool ()>) + 845
vt: [3] 38  18  0x55be2fc99abd vt::vrt::collection::balance::LBManager::waitLBCollective() + 181
vt: [3] 39  18  0x55be2fbe34f6 vt::vrt::collection::CollectionManager::startPhaseCollective(std::function<void ()>, unsigned long) + 196
vt: [3] 40  18  0x55be2f6a2ef4 ./test_lb_extended(+0x16c7ef4) [0x55be2f6a2ef4] + 0
vt: [3] 41  18  0x55be2f6a4054 ./test_lb_extended(+0x16c9054) [0x55be2f6a4054] + 0
vt: [3] 42  18  0x55be2f79136b std::function<void ()>::operator()() const + 77
vt: [3] 43  18  0x55be2feaa444 vt::runInEpochCollective(std::function<void ()>&&) + 264
vt: [3] 44  18  0x55be2f6a3232 vt::tests::unit::TestLoadBalancer_test_load_balancer_1_Test::TestBody() + 726
vt: [3] 45  18  0x55be2f90cefb void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 101
vt: [3] 46  18  0x55be2f906ef7 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 90
vt: [3] 47  18  0x55be2f8e41d4 testing::Test::Run() + 238
vt: [3] 48  18  0x55be2f8e4b59 testing::TestInfo::Run() + 271
vt: [3] 49  18  0x55be2f8e524f testing::TestSuite::Run() + 297
vt: [3] 50  18  0x55be2f8f0c61 testing::internal::UnitTestImpl::RunAllTests() + 1029
vt: [3] 51  18  0x55be2f90dff3 bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 101
vt: [3] 52  18  0x55be2f907dd3 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 90
vt: [3] 53  18  0x55be2f8ef55a testing::UnitTest::Run() + 192
vt: [3] 54  18  0x55be2f677c1e RUN_ALL_TESTS() + 35
vt: [3] 55  18  0x55be2f6769aa main + 109
vt: [3] 56  18  0x7fe93d5a6b97 __libc_start_main + 231
vt: [3] 57  18  0x55be2f6761aa _start + 42
vt: [3] ------------------------------------------------------------------------------------------------------------------------
lifflander commented 4 years ago

This is causing test failures on develop regularly now.

PhilMiller commented 3 years ago

https://github.com/DARMA-tasking/vt/pull/1013/checks?check_run_id=1418863757 again, though the assertion output is different.

nlslatt commented 3 years ago

Seeing this again in https://dev.azure.com/DARMA-tasking/DARMA/_build/results?buildId=13207&view=logs&j=3dc8fd7e-4368-5a92-293e-d53cefc8c4b3&t=28db5144-7e5d-5c90-2820-8676d630d9d2&l=2376

lifflander commented 3 years ago

This is fixed. YAY