DARMA-tasking / vt

DARMA/vt => Virtual Transport
Other
35 stars 8 forks source link

#2201: implement memory aware TemperedLB in vt #2203

Closed ppebay closed 4 months ago

ppebay commented 10 months ago

Resolves #2201

This PR in particular:

github-actions[bot] commented 10 months ago

Pipelines results

PR tests (gcc-12, ubuntu, mpich)

Build for 10c35df650e37aec576c9f53b694de9d81c98759 (2024-01-25 22:56:08 UTC)


The following tests FAILED:
  238 - vt:TestCheckpoint.test_checkpoint_in_place_2_proc_2 (Timeout)
  239 - vt:TestCheckpoint.test_checkpoint_in_place_3_proc_2 (Timeout)
  255 - vt:*/TestLoadBalancerOther.test_load_balancer_other_1/*_proc_2 (Timeout)
  256 - vt:*/TestLoadBalancerOther.test_load_balancer_other_keep_last_elm/*_proc_2 (Timeout)

Build log


nlslatt commented 8 months ago

@lifflander @ppebay I am unable to run this in a production environment:

vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [13] ------------------------------------------------ Fatal Error on Node 13 ------------------------------------------------
vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] 
vt: [13]              Reason: Event being closed must be on the top of the open event stack.
vt: [13]    Assertion failed: (not open_events_.empty() and open_events_.back().ep == ep and open_events_.back().event == event)
vt: [13]                Node: 13
vt: [13]           Num Nodes: 14
vt: [13]                File: vt/src/vt/trace/trace.cc
vt: [13]                Line: 398
vt: [13]            Function: endProcessing
vt: [13]                Code: 1
vt: [13]           Build SHA: 68121476eacc3b25e4703bfbd22c9d91275f6046
vt: [13]           Build Ref: refs/heads/2201-implement-memory-aware-temperedlb-in-vt
vt: [13]         Description: heads/load-balancing-0-g68121476ea
vt: [13]            GIT Repo: *dirty*
vt: [13]            Hostname: mz7
vt: [13] 
vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] ------------------------------------------- Dump Stack Backtrace on Node 13 --------------------------------------------
vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] 0   18  0x1f54602 vt::debug::stack::dumpStack(int) + 50
vt: [13] 1   18  0x1b29cec vt::runtime::Runtime::output(std::string, int, bool, bool, bool) + 1516
vt: [13] 2   18  0x19d555e vt::CollectiveAnyOps<(vt::runtime::eRuntimeInstance)0>::output(std::string, int, bool, bool, bool, bool) + 94
vt: [13] 3   18  0x19d1da3 vt::output(std::string, int, bool, bool, bool, bool) + 67
vt: [13] 4   18  0x1dc1e80 std::enable_if<std::tuple_size<std::tuple<> >::value==(0), void>::type vt::debug::assert::assertOut<>(bool, std::string, std::string const&, std::string const&, int, std::string const&, int, std::tuple<>&&) [clone .isra.0] + 192
vt: [13] 5   18  0x1dcfaf9 vt::trace::Trace::endProcessing(vt::trace::TraceProcessingTag const&, vt::TimeTypeWrapper) + 681
vt: [13] 6   18  0x1dd11d8 std::_Function_handler<void (), vt::trace::Trace::startup()::{lambda()#1}>::_M_invoke(std::_Any_data const&) + 24
vt: [13] 7   18  0x1f0af42 vt::sched::Scheduler::triggerEvent(vt::sched::SchedulerEvent const&) + 98
vt: [13] 8   18  0x229f8fd vt::vrt::collection::lb::TemperedLB::considerSwapsAfterLock(vt::messaging::MsgSharedPtr<vt::vrt::collection::lb::TemperedLB::LockedInfoMsg>) + 3965
vt: [13] 9   18  0x22a3c77 vt::vrt::collection::lb::TemperedLB::lockObtained(vt::vrt::collection::lb::TemperedLB::LockedInfoMsg*) + 2983
vt: [13] 10  18  0x1ee491a vt::runnable::RunnableNew::run() + 138
vt: [13] 11  18  0x2336fda vt::sched::BaseUnit::execute() + 26
vt: [13] 12  18  0x1f114bc vt::sched::Scheduler::runWorkUnit(vt::sched::BaseUnit&) + 92
vt: [13] 13  18  0x1f11bf7 vt::sched::Scheduler::runSchedulerOnceImpl(bool) + 1063
vt: [13] 14  18  0x22ac4b7 vt::vrt::collection::lb::TemperedLB::swapClusters() + 695
vt: [13] 15  18  0x22b35f6 vt::vrt::collection::lb::TemperedLB::doLBStages(double) + 7478
vt: [13] 16  18  0x22b419c vt::vrt::collection::lb::TemperedLB::runLB(double) + 1004
vt: [13] 17  0   0x0 Unwinding error: unable to obtain symbol name for this frame + 0
vt: [13] 18  18  0x1cf2fba vt::vrt::collection::balance::LBManager::runLB(unsigned long, vt::pipe::callback::cbunion::CallbackTyped<vt::vrt::collection::balance::ReassignmentMsg>) + 2234
vt: [13] 19  18  0x1cf41bd vt::vrt::collection::balance::LBManager::startLB(unsigned long, vt::vrt::collection::balance::LBType, vt::pipe::callback::cbunion::CallbackTyped<vt::vrt::collection::balance::ReassignmentMsg>) + 3053
vt: [13] 20  18  0x1cf4ec9 vt::vrt::collection::balance::LBManager::selectStartLB(unsigned long) + 569
vt: [13] 21  18  0x1aea35f void vt::runInEpoch<vt::phase::PhaseManager::runHooks(vt::phase::PhaseHook)::{lambda()#1}>(vt::epoch::EpochType, vt::phase::PhaseManager::runHooks(vt::phase::PhaseHook)::{lambda()#1}&&) + 111
vt: [13] 22  18  0x1aea9f3 vt::phase::PhaseManager::runHooks(vt::phase::PhaseHook) + 787
vt: [13] 23  18  0x1aef658 vt::phase::PhaseManager::nextPhaseCollective() + 328
[snip]
vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [13] ------------------------------------------------ Fatal Error on Node 13 ------------------------------------------------
vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] 
vt: [13] Message: Assertion Failed
vt: [13] 
vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] ------------------------------------------- Dump Stack Backtrace on Node 13 --------------------------------------------
vt: [13] ------------------------------------------------------------------------------------------------------------------------
vt: [13] 0   18  0x1f54602 vt::debug::stack::dumpStack(int) + 50
vt: [13] 1   18  0x1b29cec vt::runtime::Runtime::output(std::string, int, bool, bool, bool) + 1516
vt: [13] 2   18  0x1b2a437 vt::runtime::Runtime::abort(std::string, int) + 55
vt: [13] 3   18  0x19d5418 vt::CollectiveAnyOps<(vt::runtime::eRuntimeInstance)0>::abort(std::string, int) + 72
vt: [13] 4   18  0x19d1d00 vt::abort(std::string, int) + 32
vt: [13] 5   18  0x19d55cf vt::CollectiveAnyOps<(vt::runtime::eRuntimeInstance)0>::output(std::string, int, bool, bool, bool, bool) + 207
vt: [13] 6   18  0x19d1da3 vt::output(std::string, int, bool, bool, bool, bool) + 67
vt: [13] 7   18  0x1dc1e80 std::enable_if<std::tuple_size<std::tuple<> >::value==(0), void>::type vt::debug::assert::assertOut<>(bool, std::string, std::string const&, std::string const&, int, std::string const&, int, std::tuple<>&&) [clone .isra.0] + 192
vt: [13] 8   18  0x1dcfaf9 vt::trace::Trace::endProcessing(vt::trace::TraceProcessingTag const&, vt::TimeTypeWrapper) + 681
vt: [13] 9   18  0x1dd11d8 std::_Function_handler<void (), vt::trace::Trace::startup()::{lambda()#1}>::_M_invoke(std::_Any_data const&) + 24
vt: [13] 10  18  0x1f0af42 vt::sched::Scheduler::triggerEvent(vt::sched::SchedulerEvent const&) + 98
vt: [13] 11  18  0x229f8fd vt::vrt::collection::lb::TemperedLB::considerSwapsAfterLock(vt::messaging::MsgSharedPtr<vt::vrt::collection::lb::TemperedLB::LockedInfoMsg>) + 3965
vt: [13] 12  18  0x22a3c77 vt::vrt::collection::lb::TemperedLB::lockObtained(vt::vrt::collection::lb::TemperedLB::LockedInfoMsg*) + 2983
vt: [13] 13  18  0x1ee491a vt::runnable::RunnableNew::run() + 138
vt: [13] 14  18  0x2336fda vt::sched::BaseUnit::execute() + 26
vt: [13] 15  18  0x1f114bc vt::sched::Scheduler::runWorkUnit(vt::sched::BaseUnit&) + 92
vt: [13] 16  18  0x1f11bf7 vt::sched::Scheduler::runSchedulerOnceImpl(bool) + 1063
vt: [13] 17  18  0x22ac4b7 vt::vrt::collection::lb::TemperedLB::swapClusters() + 695
vt: [13] 18  18  0x22b35f6 vt::vrt::collection::lb::TemperedLB::doLBStages(double) + 7478
vt: [13] 19  18  0x22b419c vt::vrt::collection::lb::TemperedLB::runLB(double) + 1004
vt: [13] 20  0   0x0 Unwinding error: unable to obtain symbol name for this frame + 0
vt: [13] 21  18  0x1cf2fba vt::vrt::collection::balance::LBManager::runLB(unsigned long, vt::pipe::callback::cbunion::CallbackTyped<vt::vrt::collection::balance::ReassignmentMsg>) + 2234
vt: [13] 22  18  0x1cf41bd vt::vrt::collection::balance::LBManager::startLB(unsigned long, vt::vrt::collection::balance::LBType, vt::pipe::callback::cbunion::CallbackTyped<vt::vrt::collection::balance::ReassignmentMsg>) + 3053
vt: [13] 23  18  0x1cf4ec9 vt::vrt::collection::balance::LBManager::selectStartLB(unsigned long) + 569
vt: [13] 24  18  0x1aea35f void vt::runInEpoch<vt::phase::PhaseManager::runHooks(vt::phase::PhaseHook)::{lambda()#1}>(vt::epoch::EpochType, vt::phase::PhaseManager::runHooks(vt::phase::PhaseHook)::{lambda()#1}&&) + 111
vt: [13] 25  18  0x1aea9f3 vt::phase::PhaseManager::runHooks(vt::phase::PhaseHook) + 787
vt: [13] 26  18  0x1aef658 vt::phase::PhaseManager::nextPhaseCollective() + 328
[snip]
lifflander commented 8 months ago

I think that the way I've implemented this with a recursive handler is causing tracing issues. I will look into it.

@lifflander @ppebay I am unable to run this in a production environment:

nlslatt commented 4 months ago

I fixed the above typos in the rebased branch already.

lifflander commented 4 months ago

I fixed the above typos in the rebased branch already.

Do you think we should fully implement the sub-clustering with the full work model before we merge this?

nlslatt commented 4 months ago

Closing because this is superseded by #2278