Closed lifflander closed 3 years ago
Describe the bug I've reproduced this in two contexts. Either in the docker container (GNU gcc-7, debug) or on my Mac with (clang-5).
Run this to reproduce:
ctest -I 157,158 --repeat-until-fail 1000 --output-on-failure .
Assertion that breaks:
vt: [0] lb: LBManager::releaseNow: finished LB, phase=3, invocations=1 vt: [0] lb: BaseLB: Statistic=P_l: max=5.10, min=4.55, sum=19.24, avg=4.81, var=0.04, stdev=0.20, nproc=4, cardinality=4 skewness=0.17, kurtosis=-1.87, npr=4, imb=0.06, num_stats=1 vt: [0] lb: BaseLB: Statistic=O_l: max=0.001, min=0.000, sum=0.02, avg=0.000, var=0.000, stdev=0.000, nproc=64, cardinality=64 skewness=0.02, kurtosis=-1.25, npr=64, imb=1.06, num_stats=2 vt: [0] lb: loadStats: load=4.55, total=19.24, avg=4.81, I=0.06,should_lb=true, auto=true, threshold=0.9390901317338556 vt: [1] ------------------------------------------------------------------------------------------------------------------------ vt: [1] ------------------------------------------- Runtime Error: System Aborting! -------------------------------------------- vt: [1] ------------------------------------------------ Fatal Error on Node 1 ------------------------------------------------- vt: [1] ------------------------------------------------------------------------------------------------------------------------ vt: [1] vt: [1] Reason: Must have object vt: [1] Assertion failed: (theProcStats()->hasObjectToMigrate(obj_id)) vt: [1] Node: 1 vt: [1] Num Nodes: 4 vt: [1] File: /vt/src/vt/vrt/collection/balance/baselb/baselb.cc vt: [1] Line: 230 vt: [1] Function: transferMigrations vt: [1] Code: 1 vt: [1] Build SHA: 181e188d3fca91bab0a2d0efc765d8366031e5da vt: [1] Build Ref: refs/heads/develop vt: [1] Description: heads/develop-0-g181e188d3f vt: [1] GIT Repo: *dirty* vt: [1] Hostname: 41fe2b81da16 vt: [1] vt: [2] ------------------------------------------------------------------------------------------------------------------------ vt: [2] ------------------------------------------- Runtime Error: System Aborting! -------------------------------------------- vt: [2] ------------------------------------------------ Fatal Error on Node 2 ------------------------------------------------- vt: [2] ------------------------------------------------------------------------------------------------------------------------ vt: [2] vt: [2] Reason: Must have object vt: [2] Assertion failed: (theProcStats()->hasObjectToMigrate(obj_id)) vt: [2] Node: 2 vt: [2] Num Nodes: 4 vt: [2] File: /vt/src/vt/vrt/collection/balance/baselb/baselb.cc vt: [2] Line: 230 vt: [2] Function: transferMigrations vt: [2] Code: 1 vt: [2] Build SHA: 181e188d3fca91bab0a2d0efc765d8366031e5da vt: [2] Build Ref: refs/heads/develop vt: [2] Description: heads/develop-0-g181e188d3f vt: [2] GIT Repo: *dirty* vt: [2] Hostname: 41fe2b81da16 vt: [2] vt: [3] ------------------------------------------------------------------------------------------------------------------------ vt: [3] ------------------------------------------- Runtime Error: System Aborting! -------------------------------------------- vt: [3] ------------------------------------------------ Fatal Error on Node 3 ------------------------------------------------- vt: [3] ------------------------------------------------------------------------------------------------------------------------ vt: [3] vt: [3] Reason: Must have object vt: [3] Assertion failed: (theProcStats()->hasObjectToMigrate(obj_id)) vt: [3] Node: 3 vt: [3] Num Nodes: 4 vt: [3] File: /vt/src/vt/vrt/collection/balance/baselb/baselb.cc vt: [3] Line: 230 vt: [3] Function: transferMigrations vt: [3] Code: 1 vt: [3] Build SHA: 181e188d3fca91bab0a2d0efc765d8366031e5da vt: [3] Build Ref: refs/heads/develop vt: [3] Description: heads/develop-0-g181e188d3f vt: [3] GIT Repo: *dirty* vt: [3] Hostname: 41fe2b81da16 vt: [3] vt: [3] ------------------------------------------------------------------------------------------------------------------------ vt: [3] -------------------------------------------- Dump Stack Backtrace on Node 3 -------------------------------------------- vt: [3] ------------------------------------------------------------------------------------------------------------------------ vt: [3] 0 18 0x55be2ff00548 vt::debug::stack::dumpStack[abi:cxx11](int) + 83 vt: [3] 1 18 0x55be2fb00c98 vt::runtime::Runtime::output(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, bool, bool, bool) + 1868 vt: [3] 2 18 0x55be2f99e3cf vt::CollectiveAnyOps<(vt::runtime::eRuntimeInstance)0>::output(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, bool, bool, bool, bool) + 209 vt: [3] 3 18 0x55be2f99d163 vt::output(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, bool, bool, bool, bool) + 143 vt: [3] 4 18 0x55be2f78b85f std::enable_if<std::tuple_size<std::tuple<> >::value==(0), void>::type vt::debug::assert::assertOut<>(bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::tuple<>&&) + 359 vt: [3] 5 18 0x55be30102aae vt::vrt::collection::lb::BaseLB::transferMigrations(vt::vrt::collection::lb::TransferMsg<std::vector<std::tuple<unsigned long, short>, std::allocator<std::tuple<unsigned long, short> > > >*) + 682 vt: [3] 6 18 0x55be2fcfe1e6 vt::objgroup::dispatch::Dispatch<vt::vrt::collection::lb::BaseLB>::run(long, vt::messaging::BaseMsg*) + 920 vt: [3] 7 18 0x55be2fd137b6 vt::objgroup::ObjGroupManager::dispatch(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >, long) + 860 vt: [3] 8 18 0x55be2fd142c8 vt::objgroup::dispatchObjGroup(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >, long) + 150 vt: [3] 9 18 0x55be2f7fcb1f vt::runnable::Runnable<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >::runObj(long, vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope>*, short) + 725 vt: [3] 10 18 0x55be2fd143ab ./test_lb_extended(+0x1d393ab) [0x55be2fd143ab] + 0 vt: [3] 11 18 0x55be2fd147bf ./test_lb_extended(+0x1d397bf) [0x55be2fd147bf] + 0 vt: [3] 12 18 0x55be2f79136b std::function<void ()>::operator()() const + 77 vt: [3] 13 18 0x55be2feaee2d vt::sched::PriorityUnit::execute() + 467 vt: [3] 14 18 0x55be2feaec4d vt::sched::PriorityUnit::operator()() + 33 vt: [3] 15 18 0x55be2fea803f vt::sched::Scheduler::runWorkUnit(vt::sched::PriorityUnit&) + 691 vt: [3] 16 18 0x55be2fea8a1e vt::sched::Scheduler::scheduler(bool) + 566 vt: [3] 17 18 0x55be2fea8f75 vt::sched::Scheduler::runSchedulerWhile(std::function<bool ()>) + 845 vt: [3] 18 18 0x55be2feaa06b vt::runSchedulerThrough(unsigned long) + 145 vt: [3] 19 18 0x55be2feaa4f1 vt::runInEpochCollective(std::function<void ()>&&) + 437 vt: [3] 20 18 0x55be2fcc379c void vt::vrt::collection::balance::LBManager::makeLB<vt::vrt::collection::lb::GreedyLB>(vt::messaging::MsgSharedPtr<vt::vrt::collection::balance::StartLBMsg>) + 702 vt: [3] 21 18 0x55be2fc98fd0 vt::vrt::collection::balance::LBManager::collectiveImpl(unsigned long, vt::vrt::collection::balance::LBType, bool, unsigned long) + 738 vt: [3] 22 18 0x55be2f86c742 void vt::vrt::collection::balance::LBManager::sysLB<vt::vrt::collection::balance::InvokeBaseMsg<vt::collective::reduce::operators::ReduceTMsg<char> > >(vt::vrt::collection::balance::InvokeBaseMsg<vt::collective::reduce::operators::ReduceTMsg<char> >*) + 214 vt: [3] 23 18 0x55be2fcfe7cc vt::objgroup::dispatch::Dispatch<vt::vrt::collection::balance::LBManager>::run(long, vt::messaging::BaseMsg*) + 920 vt: [3] 24 18 0x55be2fd137b6 vt::objgroup::ObjGroupManager::dispatch(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >, long) + 860 vt: [3] 25 18 0x55be2fd142c8 vt::objgroup::dispatchObjGroup(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >, long) + 150 vt: [3] 26 18 0x55be2f7fcb1f vt::runnable::Runnable<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >::runObj(long, vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope>*, short) + 725 vt: [3] 27 18 0x55be2f7e8924 vt::runnable::Runnable<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >::run(long, void (*)(vt::messaging::BaseMsg*), vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope>*, short, int) + 144 vt: [3] 28 18 0x55be2fd4e7b3 vt::messaging::ActiveMessenger::deliverActiveMsg(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> > const&, short const&, bool, std::function<void ()>) + 1821 vt: [3] 29 18 0x55be2fd4dfa6 vt::messaging::ActiveMessenger::processActiveMsg(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> > const&, short const&, int const&, bool, std::function<void ()>) + 476 vt: [3] 30 18 0x55be2fd4d853 ./test_lb_extended(+0x1d72853) [0x55be2fd4d853] + 0 vt: [3] 31 18 0x55be2fd51a52 ./test_lb_extended(+0x1d76a52) [0x55be2fd51a52] + 0 vt: [3] 32 18 0x55be2f79136b std::function<void ()>::operator()() const + 77 vt: [3] 33 18 0x55be2feaee2d vt::sched::PriorityUnit::execute() + 467 vt: [3] 34 18 0x55be2feaec4d vt::sched::PriorityUnit::operator()() + 33 vt: [3] 35 18 0x55be2fea803f vt::sched::Scheduler::runWorkUnit(vt::sched::PriorityUnit&) + 691 vt: [3] 36 18 0x55be2fea8a1e vt::sched::Scheduler::scheduler(bool) + 566 vt: [3] 37 18 0x55be2fea8f75 vt::sched::Scheduler::runSchedulerWhile(std::function<bool ()>) + 845 vt: [3] 38 18 0x55be2fc99abd vt::vrt::collection::balance::LBManager::waitLBCollective() + 181 vt: [3] 39 18 0x55be2fbe34f6 vt::vrt::collection::CollectionManager::startPhaseCollective(std::function<void ()>, unsigned long) + 196 vt: [3] 40 18 0x55be2f6a2ef4 ./test_lb_extended(+0x16c7ef4) [0x55be2f6a2ef4] + 0 vt: [3] 41 18 0x55be2f6a4054 ./test_lb_extended(+0x16c9054) [0x55be2f6a4054] + 0 vt: [3] 42 18 0x55be2f79136b std::function<void ()>::operator()() const + 77 vt: [3] 43 18 0x55be2feaa444 vt::runInEpochCollective(std::function<void ()>&&) + 264 vt: [3] 44 18 0x55be2f6a3232 vt::tests::unit::TestLoadBalancer_test_load_balancer_1_Test::TestBody() + 726 vt: [3] 45 18 0x55be2f90cefb void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 101 vt: [3] 46 18 0x55be2f906ef7 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 90 vt: [3] 47 18 0x55be2f8e41d4 testing::Test::Run() + 238 vt: [3] 48 18 0x55be2f8e4b59 testing::TestInfo::Run() + 271 vt: [3] 49 18 0x55be2f8e524f testing::TestSuite::Run() + 297 vt: [3] 50 18 0x55be2f8f0c61 testing::internal::UnitTestImpl::RunAllTests() + 1029 vt: [3] 51 18 0x55be2f90dff3 bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 101 vt: [3] 52 18 0x55be2f907dd3 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 90 vt: [3] 53 18 0x55be2f8ef55a testing::UnitTest::Run() + 192 vt: [3] 54 18 0x55be2f677c1e RUN_ALL_TESTS() + 35 vt: [3] 55 18 0x55be2f6769aa main + 109 vt: [3] 56 18 0x7fe93d5a6b97 __libc_start_main + 231 vt: [3] 57 18 0x55be2f6761aa _start + 42 vt: [3] ------------------------------------------------------------------------------------------------------------------------
This is causing test failures on develop regularly now.
https://github.com/DARMA-tasking/vt/pull/1013/checks?check_run_id=1418863757 again, though the assertion output is different.
Seeing this again in https://dev.azure.com/DARMA-tasking/DARMA/_build/results?buildId=13207&view=logs&j=3dc8fd7e-4368-5a92-293e-d53cefc8c4b3&t=28db5144-7e5d-5c90-2820-8676d630d9d2&l=2376
This is fixed. YAY
Describe the bug I've reproduced this in two contexts. Either in the docker container (GNU gcc-7, debug) or on my Mac with (clang-5).
Run this to reproduce:
Assertion that breaks: