DARMA-tasking / vt

DARMA/vt => Virtual Transport
Other
35 stars 9 forks source link

Segfault during nextPhaseCollective after destroying collection #2238

Closed nlslatt closed 8 months ago

nlslatt commented 8 months ago

Describe the bug

There is a segfault in collection::Holder::foreach() when nextPhaseCollective() is called after a collection has been destroyed. This is an issue for applications that have collections that are no longer relevant for later phases.

To Reproduce This can be observed several ways. One way is to add a few lines to examples/collection/lb_iter.cc. Within the loop over phases, right before the nextPhaseCollective() call, conditionally destroy the collection:

diff --git a/examples/collection/lb_iter.cc b/examples/collection/lb_iter.cc
index 6d467b034..53c2b137c 100644
--- a/examples/collection/lb_iter.cc
+++ b/examples/collection/lb_iter.cc
@@ -130,6 +130,10 @@ int main(int argc, char** argv) {
       fmt::print("iteration: iter={},time={}\n", i, total_time);
     }

+    if (i == num_iter-1) {
+      vt::theCollection()->destroy(proxy);
+    }
+
     vt::thePhase()->nextPhaseCollective();
   }

This happens whether or not the destroyed collection was used earlier in the phase in which it was destroyed. This can be seen even in a single-rank run.

vt: [0] (t) phase: phase=6, duration=346e-3 s, rank_max_compute_time=346e-3 s, rank_avg_compute_time=346e-3 s, imbalance=0.000, grain_max_time=35.1e-3 s, migration count=0, lb_name=NoLB
0: iterWork: idx=idx(0)
0: iterWork: idx=idx(0)
0: iterWork: idx=idx(0)
iteration: iter=7,time=349e-3 s
vt: Caught SIGSEGV signal: 11 
vt: [0] ------------------------------------------------------------------------------------------------------------------------
vt: [0] -------------------------------------------- Dump Stack Backtrace on Node 0 --------------------------------------------
vt: [0] ------------------------------------------------------------------------------------------------------------------------
vt: [0] 0   18  0x1090bd762 vt::debug::stack::dumpStack(int) + 66
vt: [0] 1   18  0x1092b0891 vt::runtime::Runtime::handleSignalFailure() + 177
vt: [0] 2   18  0x1092b0adc vt::runtime::Runtime::sigHandler(int) + 156
vt: [0] 3   18  0x7ff8045105ed _sigtramp + 29
vt: [0] 4   18  0x109036668 vt::vrt::collection::Holder<vt::index::DenseIndexArray<int, (signed char)1>>::foreach(std::__1::function<void (vt::index::DenseIndexArray<int, (signed char)1> const&, vt::vrt::collection::Indexable<vt::index::DenseIndexArray<int, (signed char)1>>*)>) + 40
vt: [0] 5   18  0x109039a6b _ZN2vt3vrt10collection17CollectionManager19invokeCollectiveMsgINS1_7balance15CollectStatsMsgI7IterColEEXadL_ZNS4_16CollectionLBData13syncNextPhaseIS6_EEvPT_PNS5_ISA_EEEEEEvRKNS1_15CollectionProxyINSA_14CollectionTypeENSA_14CollectionType9IndexTypeEEENS_9m + 539
vt: [0] 6   18  0x1090397d7 _ZNK2vt3vrt10collection13BroadcastableI7IterColNS_5index15DenseIndexArrayIiLa1EEENS1_10ModifiableIS3_S6_NS1_8RDMAableIS3_S6_NS1_19BaseCollectionProxyIS3_S6_EEEEEEE16invokeCollectiveINS1_7balance15CollectStatsMsgIS3_EEXadL_ZNSF_16CollectionLBData13syncNext + 87
vt: [0] 7   18  0x10903973e std::__1::__function::__func<void vt::vrt::collection::CollectionManager::insertMetaCollection<IterCol, long long const&, bool const&, unsigned long long const&, bool const&, vt::index::DenseIndexArray<int, (signed char)1> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, unsigned long long, long long const&, bool const&, unsigned long long const&, bool const&, vt::index::DenseIndexArray<int, (signed char)1> const&)::'lambda0'(), std::__1::allocator<void vt::vrt::collection::CollectionManager::insertMetaCollection<IterCol, long long const&, bool const&, unsigned long long const&, bool const&, vt::index::DenseIndexArray<int, (signed char)1> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, unsigned long long, long long const&, bool const&, unsigned long long const&, bool const&, vt::index::DenseIndexArray<int, (signed char)1> const&)::'lambda0'()>, void ()>::operator()() + 46
vt: [0] 8   18  0x1093898ef std::__1::__function::__func<vt::vrt::collection::CollectionManager::startup()::$_0, std::__1::allocator<vt::vrt::collection::CollectionManager::startup()::$_0>, void ()>::operator()() + 47
vt: [0] 9   18  0x1092636c2 vt::phase::PhaseManager::runHooks(vt::phase::PhaseHook) + 1474
vt: [0] 10  18  0x109263c4b vt::phase::PhaseManager::nextPhaseCollective() + 619
vt: [0] 11  18  0x108f9d0cd main + 541
vt: [0] 12  18  0x7ff80418941f start + 1903
lifflander commented 8 months ago

We need to remove the proxy from CollectionManager::collect_lb_data_for_lb_ when the proxy is destroyed. Also, invokeCollectiveMsg should throw a sensible error if the proxy is not found.