Closed grondo closed 1 month ago
@milroy - have you had a chance to work on this? Is @trws offline for a while?
It would be nice to have a fix for this in our next release.
Edit: er, I think I probably understated that one. See @grondo's urgency note above.
I'm not sure @milroy is around this week. @jameshcorbett would you be available to help run this down?
I can try. Alternatively, Tom said he should be back Friday.
The PR I already have up to change the ref counted strings to sense should fix this. I need to change resource query to use that code path so we get enough test coverage, but that’s probably the best way to get this dealt with.
Apologies for not getting on this sooner, were under the gun to get a bunch of stuff done before the end of the meetings out here in Perth because of the upcoming major spec release. Will try to get that PR shaped up in the coming few hours.
Get Outlook for iOShttps://aka.ms/o0ukef
From: James Corbett @.> Sent: Wednesday, September 25, 2024 6:44:12 AM To: flux-framework/flux-sched @.> Cc: Scogland, Tom @.>; Mention @.> Subject: Re: [flux-framework/flux-sched] fluxion crash in 10% of user runs on elcap (Issue #1298)
I can try. Alternatively, Tom said he should be back Friday.
— Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https://github.com/flux-framework/flux-sched/issues/1298*issuecomment-2372514715__;Iw!!G2kpM7uM-TzIFchu!1Eo-LBCoK32rGZDX-lSOEn69LMcVfwd8DA74a1VfmsEK2KC2LqgIZCi3tKZm0xksWX_qLu_-54vIbTY30impDjqPDHU$, or unsubscribehttps://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AAFBFNIAGKEXDEGGUHZ7BPDZYHTLZAVCNFSM6AAAAABOOSCDDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZSGUYTINZRGU__;!!G2kpM7uM-TzIFchu!1Eo-LBCoK32rGZDX-lSOEn69LMcVfwd8DA74a1VfmsEK2KC2LqgIZCi3tKZm0xksWX_qLu_-54vIbTY30imp5_76xiI$. You are receiving this because you were mentioned.Message ID: @.***>
I think the rank 0 broker on tuolumne hit this issue and crashed last night: (backtrace included in case it is helpful)
If there's any hint on how to prevent the problem, elcap/tuolumne users are interested...
#0 0x00007ffff516ac30 in std::_Hash_bytes(void const*, unsigned long, unsigned long) () from /lib64/libstdc++.so.6
#1 0x00007ffff7fe6e45 in std::_Hash_impl::hash (__seed=3339675911,
__clength=<optimized out>, __ptr=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/functional_hash.h:206
#2 std::hash<std::basic_string_view<char, std::char_traits<char> > >::operator() (__str=..., this=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/string_view:698
#3 intern::detail::string_hash::operator() (str=..., this=0x948a10)
at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.cpp:40
#4 std::__detail::_Hash_code_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node>, std::__detail::_Select1st, intern::detail::string_hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, true>::_M_hash_code (
__k=..., this=0x948a10)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/hashtable_policy.h:1270
#5 std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node> >, std::__detail::_Select1st, std::equal_to<void>, intern::detail::string_hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_erase (
__k=..., this=0x948a10)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/hashtable.h:2360
#6 std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node>, std::--Type <RET> for more, q to quit, c to continue without paging--
allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node> >, std::__detail::_Select1st, std::equal_to<void>, intern::detail::string_hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::erase (
__k=..., this=0x948a10)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/hashtable.h:973
#7 std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, intern::detail::sparse_string_node, intern::detail::string_hash, std::equal_to<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node> > >::erase (__x=..., this=0x948a10)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/unordered_map.h:763
#8 intern::detail::remove_rc (storage=0x948a00, s=<optimized out>)
at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.cpp:93
#9 0x00007ffff7fe8652 in std::_Sp_counted_deleter<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (
this=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:526
#10 0x00007ffff7ee2b76 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fff7307c490)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:354
#11 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (
this=0x7fff7307c490)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:317
#12 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (
this=0x7fff73c4e760, __in_chrg=<optimized out>)
--Type <RET> for more, q to quit, c to continue without paging--
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1071
#13 std::__shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (
this=0x7fff73c4e758, __in_chrg=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1524
#14 std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>::~shared_ptr (this=0x7fff73c4e758,
__in_chrg=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr.h:175
#15 intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >::~interned_string (this=0x7fff73c4e758, __in_chrg=<optimized out>)
at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.hpp:162
#16 Flux::resource_model::detail::evals_t::~evals_t (this=0x7fff73c4e730,
__in_chrg=<optimized out>)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/evaluators/edge_eval_api.cpp:95
#17 0x00007ffff7ede427 in std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t>::~pair (this=0x7fff73c4e720, __in_chrg=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_pair.h:185
#18 std::destroy_at<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > (__location=0x7fff73c4e720)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:88
#19 std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > >::destroy<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > (__p=0x7fff73c4e720, __a=...)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/alloc_traits.h:537
--Type <RET> for more, q to quit, c to continue without paging--
#20 std::_Rb_tree<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >::_M_destroy_node (__p=0x7fff73c4e700,
this=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:623
#21 std::_Rb_tree<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >::_M_drop_node (this=<optimized out>,
__p=0x7fff73c4e700)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:631
#22 std::_Rb_tree<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >::_M_erase (this=<optimized out>,
__x=0x7fff73c4e700)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:1937
--Type <RET> for more, q to quit, c to continue without paging--
#23 std::_Rb_tree<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >::~_Rb_tree (this=0x7fff93ffdad8,
__in_chrg=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:984
#24 std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >::~map (this=0x7fff93ffdad8, __in_chrg=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_map.h:312
#25 boost::container::allocator_traits<boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > > >::priv_destroy<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > (p=0x7fff93ffdad8)
at /usr/include/boost/container/allocator_traits.hpp:394
#26 boost::container::allocator_traits<boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_stor--Type <RET> for more, q to quit, c to continue without paging--
age<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > > >::destroy<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > (p=0x7fff93ffdad8, a=...)
at /usr/include/boost/container/allocator_traits.hpp:322
#27 boost::container::destroy_alloc_n<boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > >, std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >*, unsigned long> (n=0, f=0x7fff93ffdad8, a=...)
at /usr/include/boost/container/detail/copy_move_algo.hpp:960
#28 boost::container::vector<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >, boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_typ--Type <RET> for more, q to quit, c to continue without paging--
e_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > > >::~vector (this=0x7fff93ffdac0, __in_chrg=<optimized out>)
at /usr/include/boost/container/vector.hpp:1018
#29 boost::container::small_vector_base<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >, boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > >::~small_vector_base (this=0x7fff93ffdac0, __in_chrg=<optimized out>)
at /usr/include/boost/container/small_vector.hpp:323
#30 boost::container::small_vector<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >, 2ul, boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > >::~small_vector (this=0x7fff93ffdac0, __in_chrg=<optimized out>)
at /usr/include/boost/container/small_vector.hpp:473
#31 intern::interned_key_vec<intern::interned_string<intern::dense_storage<Flux::resource_model::subsystem_tag, unsigned char> >, std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource--Type <RET> for more, q to quit, c to continue without paging--
_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >, 2>::~interned_key_vec (
this=0x7fff93ffdac0, __in_chrg=<optimized out>)
at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.hpp:323
#32 Flux::resource_model::scoring_api_t::~scoring_api_t (this=0x7fff93ffdac0,
__in_chrg=<optimized out>)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/evaluators/scoring_api.cpp:35
#33 0x00007ffff7eb76fd in Flux::resource_model::detail::dfu_impl_t::dom_dfv (
this=<optimized out>, meta=..., u=<optimized out>, resources=...,
pristine=<optimized out>, excl=0x7fff93ffdbeb, to_parent=...)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu_impl.cpp:765
#34 0x00007ffff7eb8206 in Flux::resource_model::detail::dfu_impl_t::select (
this=this@entry=0x7fff72b4d290, j=..., root=root@entry=0, meta=...,
excl=excl@entry=false)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu_impl.cpp:1165
#35 0x00007ffff7eabb96 in Flux::resource_model::dfu_traverser_t::schedule (
this=0x7fff72b4d290, jobspec=..., meta=..., x=<optimized out>,
op=<optimized out>, root=0, dfv=std::unordered_map with 2 elements = {...})
at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu.cpp:195
#36 0x00007ffff7eac4be in Flux::resource_model::dfu_traverser_t::run (
this=this@entry=0x7fff72b4d290, jobspec=..., writers=
std::shared_ptr<Flux::resource_model::match_writers_t> (use count 1, weak count 0) = {...}, op=op@entry=MATCH_ALLOCATE_ORELSE_RESERVE,
jobid=jobid@entry=61557454135100416, at=at@entry=0x7fff93ffe278)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu.cpp:380
#37 0x00007ffff7e8165b in run (
ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...},
--Type <RET> for more, q to quit, c to continue without paging--
jobid=61557454135100416, cmd=0x7fff73c4f010 "allocate_orelse_reserve",
jstr=..., at=0x7fff93ffe278, errp=0x0)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1665
#38 0x00007ffff7e83265 in run_match (
ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...},
jobid=61557454135100416, cmd=0x7fff73c4f010 "allocate_orelse_reserve",
jstr="{\"resources\":[{\"type\":\"node\",\"count\":32,\"exclusive\":true,\"with\":[{\"type\":\"slot\",\"count\":1,\"with\":[{\"type\":\"core\",\"count\":1}],\"label\":\"task\"}]}],\"tasks\":[{\"command\":[\"flux\",\"broker\",\"{{tmpdir}}/script\""..., now=now@entry=0x7fff93ffe280, at=at@entry=0x7fff93ffe278,
overhead=0x7fff93ffe288, o=..., errp=0x0)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/modules/resource_match.cpp:1749
#39 0x00007ffff7e8c5e4 in match_multi_request_cb (h=0x7fff72b56d60,
w=<optimized out>, msg=0x7fff8038a4b0, arg=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/new_allocator.h:80
#40 0x00007ffff7b6b61a in call_handler (mh=0x7fff72b70810,
msg=msg@entry=0x7fff8038a4b0) at msg_handler.c:344
#41 0x00007ffff7b6bc4b in dispatch_message (type=1, msg=0x7fff8038a4b0,
d=0x7fff72b9e640) at msg_handler.c:380
#42 handle_cb (r=0x7fff72b71240, hw=<optimized out>, revents=<optimized out>,
arg=0x7fff72b9e640) at msg_handler.c:481
#43 0x00007ffff7b94903 in ev_invoke_pending (loop=0x7fff65c53530) at ev.c:3770
#44 0x00007ffff7b989a8 in ev_run (flags=0, loop=0x7fff65c53530) at ev.c:4190
#45 ev_run (loop=0x7fff65c53530, flags=0) at ev.c:4021
#46 0x00007ffff7b6a9ef in flux_reactor_run (r=0x7fff72b71240,
flags=<optimized out>) at reactor.c:124
#47 0x00007ffff7e8d793 in mod_main (h=0x7fff72b56d60, argc=<optimized out>,
argv=<optimized out>)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/modules/resource_match.cpp:3021
#48 0x000000000041170f in module_thread (arg=0x2ed75b0) at module.c:225
--Type <RET> for more, q to quit, c to continue without paging--
#49 0x00007ffff792f1ca in start_thread () from /lib64/libpthread.so.0
#50 0x00007ffff61c78d3 in clone () from /lib64/libc.so.6j
We should get the PR out ASAP. There must be some resource type string that isn’t held persistently, and some rare race that allows it to be reaped and then hashed. If we could figure out what resource type that is, we could create a jobspec that can’t run or something and just keep it around as a workaround, but without knowing that getting this out is the best bet.
Get Outlook for iOShttps://aka.ms/o0ukef
From: Mark Grondona @.> Sent: Friday, September 27, 2024 11:55:43 PM To: flux-framework/flux-sched @.> Cc: Scogland, Tom @.>; Mention @.> Subject: Re: [flux-framework/flux-sched] fluxion crash in 10% of user runs on elcap (Issue #1298)
I think the rank 0 broker on tuolumne hit this issue and crashed last night: (backtrace included in case it is helpful)
If there's any hint on how to prevent the problem, elcap/tuolumne users are interested...
__clength=<optimized out>, __ptr=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/functional_hash.h:206
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/string_view:698
at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.cpp:40
__k=..., this=0x948a10)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/hashtable_policy.h:1270
__k=..., this=0x948a10)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/hashtable.h:2360
allocator<std::pair<std::cxx11::basic_string<char, std::char_traits
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/unordered_map.h:763
at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.cpp:93
this=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:526
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:354
this=0x7fff7307c490)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:317
this=0x7fff73c4e760, __in_chrg=<optimized out>)
--Type
this=0x7fff73c4e758, __in_chrg=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1524
__in_chrg=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr.h:175
at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.hpp:162
__in_chrg=<optimized out>)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/evaluators/edge_eval_api.cpp:95
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_pair.h:185
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:88
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/alloc_traits.h:537
--Type
this=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:623
__p=0x7fff73c4e700)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:631
__x=0x7fff73c4e700)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:1937
--Type
__in_chrg=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:984
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_map.h:312
at /usr/include/boost/container/allocator_traits.hpp:394
age
at /usr/include/boost/container/detail/copy_move_algo.hpp:960
e_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage
at /usr/include/boost/container/small_vector.hpp:323
at /usr/include/boost/container/small_vector.hpp:473
_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage
__in_chrg=<optimized out>)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/evaluators/scoring_api.cpp:35
this=<optimized out>, meta=..., u=<optimized out>, resources=...,
pristine=<optimized out>, excl=0x7fff93ffdbeb, to_parent=...)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu_impl.cpp:765
***@***.***=0x7fff72b4d290, j=..., ***@***.***=0, meta=...,
***@***.***=false)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu_impl.cpp:1165
this=0x7fff72b4d290, jobspec=..., meta=..., x=<optimized out>,
op=<optimized out>, root=0, dfv=std::unordered_map with 2 elements = {...})
at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu.cpp:195
***@***.***=0x7fff72b4d290, jobspec=..., writers=
std::shared_ptr<Flux::resource_model::match_writers_t> (use count 1, weak count 0) = {...}, ***@***.***=MATCH_ALLOCATE_ORELSE_RESERVE,
***@***.***=61557454135100416, ***@***.***=0x7fff93ffe278)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu.cpp:380
ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...},
--Type
ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...},
jobid=61557454135100416, cmd=0x7fff73c4f010 "allocate_orelse_reserve",
jstr="{\"resources\":[{\"type\":\"node\",\"count\":32,\"exclusive\":true,\"with\":[{\"type\":\"slot\",\"count\":1,\"with\":[{\"type\":\"core\",\"count\":1}],\"label\":\"task\"}]}],\"tasks\":[{\"command\":[\"flux\",\"broker\",\"{{tmpdir}}/script\""..., ***@***.***=0x7fff93ffe280, ***@***.***=0x7fff93ffe278,
overhead=0x7fff93ffe288, o=..., errp=0x0)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/modules/resource_match.cpp:1749
w=<optimized out>, msg=0x7fff8038a4b0, arg=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/new_allocator.h:80
***@***.***=0x7fff8038a4b0) at msg_handler.c:344
d=0x7fff72b9e640) at msg_handler.c:380
arg=0x7fff72b9e640) at msg_handler.c:481
flags=<optimized out>) at reactor.c:124
argv=<optimized out>)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/modules/resource_match.cpp:3021
--Type
— Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https://github.com/flux-framework/flux-sched/issues/1298*issuecomment-2379344837__;Iw!!G2kpM7uM-TzIFchu!1uEKChpfOPeqG_X0-mj_wj-QHIkPLqEXS7C12lWDSLYBuL0tCKCfPgc9OGAiAXroTb_sLh3FFxlN6vw1NCFYH4h7II8$, or unsubscribehttps://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AAFBFNMZYOM6RVA6MM52SQDZYVPV7AVCNFSM6AAAAABOOSCDDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZZGM2DIOBTG4__;!!G2kpM7uM-TzIFchu!1uEKChpfOPeqG_X0-mj_wj-QHIkPLqEXS7C12lWDSLYBuL0tCKCfPgc9OGAiAXroTb_sLh3FFxlN6vw1NCFYk9p3LfQ$. You are receiving this because you were mentioned.Message ID: @.***>
On elcap a user is running a series of jobs that appear to cause a Fluxion crash roughly 10% of the time. We have one corefile (in my homedir under
fluxion-crash
). I believe this is still occurring for one or two users, so we'll want to determine how to get a patch for this specific issue to apply to the current flux-sched RPM asap.@trws took a look and reported:
Here's the backtrace: