flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
89 stars 41 forks source link

fluxion crash in 10% of user runs on elcap #1298

Closed grondo closed 1 month ago

grondo commented 2 months ago

On elcap a user is running a series of jobs that appear to cause a Fluxion crash roughly 10% of the time. We have one corefile (in my homedir under fluxion-crash). I believe this is still occurring for one or two users, so we'll want to determine how to get a patch for this specific issue to apply to the current flux-sched RPM asap.

@trws took a look and reported:

Ok, this is down in the resource_type_t string freeing cleanup logic. If nothing else, the in progress PR to make the dense storage backend work for resource type would make this particular failure mode impossible.

Here's the backtrace:

#0  0x00001555526c1c30 in std::_Hash_bytes(void const*, unsigned long, unsigned long) () from /lib64/libstdc++.so.6
#1  0x000015555553be45 in std::_Hash_impl::hash (__seed=3339675911, 
    __clength=<optimized out>, __ptr=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/functional_hash.h:206
#2  std::hash<std::basic_string_view<char, std::char_traits<char> > >::operator() (__str=..., this=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/string_view:698
#3  intern::detail::string_hash::operator() (str=..., this=0x754c50)
    at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.cpp:40
#4  std::__detail::_Hash_code_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node>, std::__detail::_Select1st, intern::detail::string_hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, true>::_M_hash_code (
    __k=..., this=0x754c50)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/hashtable_policy.h:1270
#5  std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node> >, std::__detail::_Select1st, std::equal_to<void>, intern::detail::string_hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_erase (
    __k=..., this=0x754c50)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/hashtable.h:2360
#6  std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node>, std::--Type <RET> for more, q to quit, c to continue without paging--c
allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node> >, std::__detail::_Select1st, std::equal_to<void>, intern::detail::string_hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::erase (__k=..., this=0x754c50) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/hashtable.h:973
#7  std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, intern::detail::sparse_string_node, intern::detail::string_hash, std::equal_to<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node> > >::erase (__x=..., this=0x754c50) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/unordered_map.h:763
#8  intern::detail::remove_rc (storage=0x754c40, s=<optimized out>) at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.cpp:93
#9  0x000015555553d652 in std::_Sp_counted_deleter<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:526
#10 0x0000155540121b76 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x1554f8945930) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:354
#11 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x1554f8945930) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:317
#12 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x1554f964ea40, __in_chrg=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1071
#13 std::__shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x1554f964ea38, __in_chrg=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1524
#14 std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>::~shared_ptr (this=0x1554f964ea38, __in_chrg=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr.h:175
#15 intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >::~interned_string (this=0x1554f964ea38, __in_chrg=<optimized out>) at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.hpp:162
#16 Flux::resource_model::detail::evals_t::~evals_t (this=0x1554f964ea10, __in_chrg=<optimized out>) at /builddir/build/BUILD/flux-sched-0.38.0/resource/evaluators/edge_eval_api.cpp:95
#17 0x000015554011d427 in std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t>::~pair (this=0x1554f964ea00, __in_chrg=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_pair.h:185
#18 std::destroy_at<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > (__location=0x1554f964ea00) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:88
#19 std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > >::destroy<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > (__p=0x1554f964ea00, __a=...) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/alloc_traits.h:537
#20 std::_Rb_tree<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >::_M_destroy_node (__p=0x1554f964e9e0, this=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:623
#21 std::_Rb_tree<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >::_M_drop_node (this=<optimized out>, __p=0x1554f964e9e0) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:631
#22 std::_Rb_tree<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >::_M_erase (this=<optimized out>, __x=0x1554f964e9e0) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:1937
#23 std::_Rb_tree<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >::~_Rb_tree (this=0x1555205d1ad8, __in_chrg=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:984
#24 std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >::~map (this=0x1555205d1ad8, __in_chrg=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_map.h:312
#25 boost::container::allocator_traits<boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > > >::priv_destroy<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > (p=0x1555205d1ad8) at /usr/include/boost/container/allocator_traits.hpp:394
#26 boost::container::allocator_traits<boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > > >::destroy<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > (p=0x1555205d1ad8, a=...) at /usr/include/boost/container/allocator_traits.hpp:322
#27 boost::container::destroy_alloc_n<boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > >, std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >*, unsigned long> (n=0, f=0x1555205d1ad8, a=...) at /usr/include/boost/container/detail/copy_move_algo.hpp:960
#28 boost::container::vector<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >, boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > > >::~vector (this=0x1555205d1ac0, __in_chrg=<optimized out>) at /usr/include/boost/container/vector.hpp:1018
#29 boost::container::small_vector_base<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >, boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > >::~small_vector_base (this=0x1555205d1ac0, __in_chrg=<optimized out>) at /usr/include/boost/container/small_vector.hpp:323
#30 boost::container::small_vector<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >, 2ul, boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > >::~small_vector (this=0x1555205d1ac0, __in_chrg=<optimized out>) at /usr/include/boost/container/small_vector.hpp:473
#31 intern::interned_key_vec<intern::interned_string<intern::dense_storage<Flux::resource_model::subsystem_tag, unsigned char> >, std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >, 2>::~interned_key_vec (this=0x1555205d1ac0, __in_chrg=<optimized out>) at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.hpp:323
#32 Flux::resource_model::scoring_api_t::~scoring_api_t (this=0x1555205d1ac0, __in_chrg=<optimized out>) at /builddir/build/BUILD/flux-sched-0.38.0/resource/evaluators/scoring_api.cpp:35
#33 0x00001555400f66fd in Flux::resource_model::detail::dfu_impl_t::dom_dfv (this=<optimized out>, meta=..., u=<optimized out>, resources=..., pristine=<optimized out>, excl=0x1555205d1beb, to_parent=...) at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu_impl.cpp:765
#34 0x00001555400f7206 in Flux::resource_model::detail::dfu_impl_t::select (this=this@entry=0x1554f8004e80, j=..., root=root@entry=0, meta=..., excl=excl@entry=false) at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu_impl.cpp:1165
#35 0x00001555400ea9ff in Flux::resource_model::dfu_traverser_t::schedule (this=0x1554f8004e80, jobspec=..., meta=..., x=<optimized out>, op=MATCH_ALLOCATE_W_SATISFIABILITY, root=0, dfv=std::unordered_map with 2 elements = {...}) at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu.cpp:157
#36 0x00001555400eb4be in Flux::resource_model::dfu_traverser_t::run (this=this@entry=0x1554f8004e80, jobspec=..., writers=std::shared_ptr<Flux::resource_model::match_writers_t> (use count 1, weak count 0) = {...}, op=op@entry=MATCH_ALLOCATE_W_SATISFIABILITY, jobid=jobid@entry=554956750848, at=at@entry=0x1555205d2278) at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu.cpp:380
#37 0x00001555400c055b in run (ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...}, jobid=554956750848, cmd=0x1554f801d8d0 "allocate_with_satisfiability", jstr=..., at=0x1555205d2278, errp=0x0) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1665
#38 0x00001555400c2265 in run_match (ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...}, jobid=554956750848, cmd=0x1554f801d8d0 "allocate_with_satisfiability", jstr="{\"resources\":[{\"type\":\"node\",\"count\":4,\"exclusive\":true,\"with\":[{\"type\":\"slot\",\"count\":4,\"with\":[{\"type\":\"core\",\"count\":1},{\"type\":\"gpu\",\"count\":1}],\"label\":\"task\"}]}],\"tasks\":[{\"command\":[\"../../../."..., now=now@entry=0x1555205d2280, at=at@entry=0x1555205d2278, overhead=0x1555205d2288, o=..., errp=0x0) at /builddir/build/BUILD/flux-sched-0.38.0/resource/modules/resource_match.cpp:1749
#39 0x00001555400cb5e4 in match_multi_request_cb (h=0x1554f8000be0, w=<optimized out>, msg=0x1554f0009c80, arg=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/new_allocator.h:80
#40 0x00001555550c261a in call_handler (mh=0x1554f80055f0, msg=msg@entry=0x1554f0009c80) at msg_handler.c:344
#41 0x00001555550c2c4b in dispatch_message (type=1, msg=0x1554f0009c80, d=0x1554f80031f0) at msg_handler.c:380
#42 handle_cb (r=0x1554f8002800, hw=<optimized out>, revents=<optimized out>, arg=0x1554f80031f0) at msg_handler.c:481
#43 0x00001555550eb903 in ev_invoke_pending (loop=0x1554f8002820) at ev.c:3770
#44 0x00001555550ef9a8 in ev_run (flags=0, loop=0x1554f8002820) at ev.c:4190
#45 ev_run (loop=0x1554f8002820, flags=0) at ev.c:4021
#46 0x00001555550c19ef in flux_reactor_run (r=0x1554f8002800, flags=<optimized out>) at reactor.c:124
#47 0x00001555400cc793 in mod_main (h=0x1554f8000be0, argc=<optimized out>, argv=<optimized out>) at /builddir/build/BUILD/flux-sched-0.38.0/resource/modules/resource_match.cpp:3021
#48 0x000000000041170f in module_thread (arg=0x755a20) at module.c:225
#49 0x0000155554e861ca in start_thread () from /lib64/libpthread.so.0
#50 0x000015555371e8d3 in clone () from /lib64/libc.so.6
garlick commented 2 months ago

@milroy - have you had a chance to work on this? Is @trws offline for a while?

It would be nice to have a fix for this in our next release.

Edit: er, I think I probably understated that one. See @grondo's urgency note above.

garlick commented 2 months ago

I'm not sure @milroy is around this week. @jameshcorbett would you be available to help run this down?

jameshcorbett commented 2 months ago

I can try. Alternatively, Tom said he should be back Friday.

trws commented 2 months ago

The PR I already have up to change the ref counted strings to sense should fix this. I need to change resource query to use that code path so we get enough test coverage, but that’s probably the best way to get this dealt with.

Apologies for not getting on this sooner, were under the gun to get a bunch of stuff done before the end of the meetings out here in Perth because of the upcoming major spec release. Will try to get that PR shaped up in the coming few hours.

Get Outlook for iOShttps://aka.ms/o0ukef


From: James Corbett @.> Sent: Wednesday, September 25, 2024 6:44:12 AM To: flux-framework/flux-sched @.> Cc: Scogland, Tom @.>; Mention @.> Subject: Re: [flux-framework/flux-sched] fluxion crash in 10% of user runs on elcap (Issue #1298)

I can try. Alternatively, Tom said he should be back Friday.

— Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https://github.com/flux-framework/flux-sched/issues/1298*issuecomment-2372514715__;Iw!!G2kpM7uM-TzIFchu!1Eo-LBCoK32rGZDX-lSOEn69LMcVfwd8DA74a1VfmsEK2KC2LqgIZCi3tKZm0xksWX_qLu_-54vIbTY30impDjqPDHU$, or unsubscribehttps://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AAFBFNIAGKEXDEGGUHZ7BPDZYHTLZAVCNFSM6AAAAABOOSCDDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZSGUYTINZRGU__;!!G2kpM7uM-TzIFchu!1Eo-LBCoK32rGZDX-lSOEn69LMcVfwd8DA74a1VfmsEK2KC2LqgIZCi3tKZm0xksWX_qLu_-54vIbTY30imp5_76xiI$. You are receiving this because you were mentioned.Message ID: @.***>

grondo commented 1 month ago

I think the rank 0 broker on tuolumne hit this issue and crashed last night: (backtrace included in case it is helpful)

If there's any hint on how to prevent the problem, elcap/tuolumne users are interested...

#0  0x00007ffff516ac30 in std::_Hash_bytes(void const*, unsigned long, unsigned long) () from /lib64/libstdc++.so.6
#1  0x00007ffff7fe6e45 in std::_Hash_impl::hash (__seed=3339675911, 
    __clength=<optimized out>, __ptr=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/functional_hash.h:206
#2  std::hash<std::basic_string_view<char, std::char_traits<char> > >::operator() (__str=..., this=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/string_view:698
#3  intern::detail::string_hash::operator() (str=..., this=0x948a10)
    at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.cpp:40
#4  std::__detail::_Hash_code_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node>, std::__detail::_Select1st, intern::detail::string_hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, true>::_M_hash_code (
    __k=..., this=0x948a10)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/hashtable_policy.h:1270
#5  std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node> >, std::__detail::_Select1st, std::equal_to<void>, intern::detail::string_hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_erase (
    __k=..., this=0x948a10)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/hashtable.h:2360
#6  std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node>, std::--Type <RET> for more, q to quit, c to continue without paging--
allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node> >, std::__detail::_Select1st, std::equal_to<void>, intern::detail::string_hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::erase (
    __k=..., this=0x948a10)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/hashtable.h:973
#7  std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, intern::detail::sparse_string_node, intern::detail::string_hash, std::equal_to<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, intern::detail::sparse_string_node> > >::erase (__x=..., this=0x948a10)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/unordered_map.h:763
#8  intern::detail::remove_rc (storage=0x948a00, s=<optimized out>)
    at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.cpp:93
#9  0x00007ffff7fe8652 in std::_Sp_counted_deleter<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (
    this=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:526
#10 0x00007ffff7ee2b76 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fff7307c490)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:354
#11 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (
    this=0x7fff7307c490)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:317
#12 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (
    this=0x7fff73c4e760, __in_chrg=<optimized out>)
--Type <RET> for more, q to quit, c to continue without paging-- 
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1071
#13 std::__shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (
    this=0x7fff73c4e758, __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1524
#14 std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>::~shared_ptr (this=0x7fff73c4e758, 
    __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr.h:175
#15 intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >::~interned_string (this=0x7fff73c4e758, __in_chrg=<optimized out>)
    at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.hpp:162
#16 Flux::resource_model::detail::evals_t::~evals_t (this=0x7fff73c4e730, 
    __in_chrg=<optimized out>)
    at /builddir/build/BUILD/flux-sched-0.38.0/resource/evaluators/edge_eval_api.cpp:95
#17 0x00007ffff7ede427 in std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t>::~pair (this=0x7fff73c4e720, __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_pair.h:185
#18 std::destroy_at<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > (__location=0x7fff73c4e720)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:88
#19 std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > >::destroy<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > (__p=0x7fff73c4e720, __a=...)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/alloc_traits.h:537
--Type <RET> for more, q to quit, c to continue without paging--
#20 std::_Rb_tree<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >::_M_destroy_node (__p=0x7fff73c4e700, 
    this=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:623
#21 std::_Rb_tree<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >::_M_drop_node (this=<optimized out>, 
    __p=0x7fff73c4e700)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:631
#22 std::_Rb_tree<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >::_M_erase (this=<optimized out>, 
    __x=0x7fff73c4e700)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:1937
--Type <RET> for more, q to quit, c to continue without paging--
#23 std::_Rb_tree<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >::~_Rb_tree (this=0x7fff93ffdad8, 
    __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:984
#24 std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >::~map (this=0x7fff93ffdad8, __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_map.h:312
#25 boost::container::allocator_traits<boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > > >::priv_destroy<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > (p=0x7fff93ffdad8)
    at /usr/include/boost/container/allocator_traits.hpp:394
#26 boost::container::allocator_traits<boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_stor--Type <RET> for more, q to quit, c to continue without paging--
age<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > > >::destroy<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > (p=0x7fff93ffdad8, a=...)
    at /usr/include/boost/container/allocator_traits.hpp:322
#27 boost::container::destroy_alloc_n<boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > >, std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >*, unsigned long> (n=0, f=0x7fff93ffdad8, a=...)
    at /usr/include/boost/container/detail/copy_move_algo.hpp:960
#28 boost::container::vector<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >, boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_typ--Type <RET> for more, q to quit, c to continue without paging--
e_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > > >::~vector (this=0x7fff93ffdac0, __in_chrg=<optimized out>)
    at /usr/include/boost/container/vector.hpp:1018
#29 boost::container::small_vector_base<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >, boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > >::~small_vector_base (this=0x7fff93ffdac0, __in_chrg=<optimized out>)
    at /usr/include/boost/container/small_vector.hpp:323
#30 boost::container::small_vector<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >, 2ul, boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > > > >::~small_vector (this=0x7fff93ffdac0, __in_chrg=<optimized out>)
    at /usr/include/boost/container/small_vector.hpp:473
#31 intern::interned_key_vec<intern::interned_string<intern::dense_storage<Flux::resource_model::subsystem_tag, unsigned char> >, std::map<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> >, Flux::resource--Type <RET> for more, q to quit, c to continue without paging--
_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_type_tag> > const, Flux::resource_model::detail::evals_t> > >, 2>::~interned_key_vec (
    this=0x7fff93ffdac0, __in_chrg=<optimized out>)
    at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.hpp:323
#32 Flux::resource_model::scoring_api_t::~scoring_api_t (this=0x7fff93ffdac0, 
    __in_chrg=<optimized out>)
    at /builddir/build/BUILD/flux-sched-0.38.0/resource/evaluators/scoring_api.cpp:35
#33 0x00007ffff7eb76fd in Flux::resource_model::detail::dfu_impl_t::dom_dfv (
    this=<optimized out>, meta=..., u=<optimized out>, resources=..., 
    pristine=<optimized out>, excl=0x7fff93ffdbeb, to_parent=...)
    at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu_impl.cpp:765

#34 0x00007ffff7eb8206 in Flux::resource_model::detail::dfu_impl_t::select (
    this=this@entry=0x7fff72b4d290, j=..., root=root@entry=0, meta=..., 
    excl=excl@entry=false)
    at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu_impl.cpp:1165
#35 0x00007ffff7eabb96 in Flux::resource_model::dfu_traverser_t::schedule (
    this=0x7fff72b4d290, jobspec=..., meta=..., x=<optimized out>, 
    op=<optimized out>, root=0, dfv=std::unordered_map with 2 elements = {...})
    at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu.cpp:195
#36 0x00007ffff7eac4be in Flux::resource_model::dfu_traverser_t::run (
    this=this@entry=0x7fff72b4d290, jobspec=..., writers=
    std::shared_ptr<Flux::resource_model::match_writers_t> (use count 1, weak count 0) = {...}, op=op@entry=MATCH_ALLOCATE_ORELSE_RESERVE, 
    jobid=jobid@entry=61557454135100416, at=at@entry=0x7fff93ffe278)
    at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu.cpp:380
#37 0x00007ffff7e8165b in run (
    ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...}, 
--Type <RET> for more, q to quit, c to continue without paging--
    jobid=61557454135100416, cmd=0x7fff73c4f010 "allocate_orelse_reserve", 
    jstr=..., at=0x7fff93ffe278, errp=0x0)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1665
#38 0x00007ffff7e83265 in run_match (
    ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...}, 
    jobid=61557454135100416, cmd=0x7fff73c4f010 "allocate_orelse_reserve", 
    jstr="{\"resources\":[{\"type\":\"node\",\"count\":32,\"exclusive\":true,\"with\":[{\"type\":\"slot\",\"count\":1,\"with\":[{\"type\":\"core\",\"count\":1}],\"label\":\"task\"}]}],\"tasks\":[{\"command\":[\"flux\",\"broker\",\"{{tmpdir}}/script\""..., now=now@entry=0x7fff93ffe280, at=at@entry=0x7fff93ffe278, 
    overhead=0x7fff93ffe288, o=..., errp=0x0)
    at /builddir/build/BUILD/flux-sched-0.38.0/resource/modules/resource_match.cpp:1749
#39 0x00007ffff7e8c5e4 in match_multi_request_cb (h=0x7fff72b56d60, 
    w=<optimized out>, msg=0x7fff8038a4b0, arg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/new_allocator.h:80
#40 0x00007ffff7b6b61a in call_handler (mh=0x7fff72b70810, 
    msg=msg@entry=0x7fff8038a4b0) at msg_handler.c:344
#41 0x00007ffff7b6bc4b in dispatch_message (type=1, msg=0x7fff8038a4b0, 
    d=0x7fff72b9e640) at msg_handler.c:380
#42 handle_cb (r=0x7fff72b71240, hw=<optimized out>, revents=<optimized out>, 
    arg=0x7fff72b9e640) at msg_handler.c:481
#43 0x00007ffff7b94903 in ev_invoke_pending (loop=0x7fff65c53530) at ev.c:3770
#44 0x00007ffff7b989a8 in ev_run (flags=0, loop=0x7fff65c53530) at ev.c:4190
#45 ev_run (loop=0x7fff65c53530, flags=0) at ev.c:4021
#46 0x00007ffff7b6a9ef in flux_reactor_run (r=0x7fff72b71240, 
    flags=<optimized out>) at reactor.c:124
#47 0x00007ffff7e8d793 in mod_main (h=0x7fff72b56d60, argc=<optimized out>, 
    argv=<optimized out>)
    at /builddir/build/BUILD/flux-sched-0.38.0/resource/modules/resource_match.cpp:3021
#48 0x000000000041170f in module_thread (arg=0x2ed75b0) at module.c:225
--Type <RET> for more, q to quit, c to continue without paging--
#49 0x00007ffff792f1ca in start_thread () from /lib64/libpthread.so.0
#50 0x00007ffff61c78d3 in clone () from /lib64/libc.so.6j
trws commented 1 month ago

We should get the PR out ASAP. There must be some resource type string that isn’t held persistently, and some rare race that allows it to be reaped and then hashed. If we could figure out what resource type that is, we could create a jobspec that can’t run or something and just keep it around as a workaround, but without knowing that getting this out is the best bet.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Mark Grondona @.> Sent: Friday, September 27, 2024 11:55:43 PM To: flux-framework/flux-sched @.> Cc: Scogland, Tom @.>; Mention @.> Subject: Re: [flux-framework/flux-sched] fluxion crash in 10% of user runs on elcap (Issue #1298)

I think the rank 0 broker on tuolumne hit this issue and crashed last night: (backtrace included in case it is helpful)

If there's any hint on how to prevent the problem, elcap/tuolumne users are interested...

0 0x00007ffff516ac30 in std::_Hash_bytes(void const*, unsigned long, unsigned long) () from /lib64/libstdc++.so.6

1 0x00007ffff7fe6e45 in std::_Hash_impl::hash (__seed=3339675911,

__clength=<optimized out>, __ptr=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/functional_hash.h:206

2 std::hash<std::basic_string_view<char, std::char_traits > >::operator() (__str=..., this=)

at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/string_view:698

3 intern::detail::string_hash::operator() (str=..., this=0x948a10)

at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.cpp:40

4 std::detail::_Hash_code_base<std::cxx11::basic_string<char, std::char_traits, std::allocator >, std::pair<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, intern::detail::sparse_string_node>, std::detail::_Select1st, intern::detail::string_hash, std::detail::_Mod_range_hashing, std::detail::_Default_ranged_hash, true>::_M_hash_code (

__k=..., this=0x948a10)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/hashtable_policy.h:1270

5 std::_Hashtable<std::cxx11::basic_string<char, std::char_traits, std::allocator >, std::pair<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, intern::detail::sparse_string_node>, std::allocator<std::pair<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, intern::detail::sparse_string_node> >, std::detail::_Select1st, std::equal_to, intern::detail::string_hash, std::detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_erase (

__k=..., this=0x948a10)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/hashtable.h:2360

6 std::_Hashtable<std::cxx11::basic_string<char, std::char_traits, std::allocator >, std::pair<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, intern::detail::sparse_string_node>, std::--Type for more, q to quit, c to continue without paging--

allocator<std::pair<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, intern::detail::sparse_string_node> >, std::detail::_Select1st, std::equal_to, intern::detail::string_hash, std::detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::detail::_Prime_rehash_policy, std::detail::_Hashtable_traits<true, false, true> >::erase ( k=..., this=0x948a10) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/hashtable.h:973

7 std::unordered_map<std::cxx11::basic_string<char, std::char_traits, std::allocator >, intern::detail::sparse_string_node, intern::detail::string_hash, std::equal_to, std::allocator<std::pair<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, intern::detail::sparse_string_node> > >::erase (__x=..., this=0x948a10)

at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/unordered_map.h:763

8 intern::detail::remove_rc (storage=0x948a00, s=)

at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.cpp:93

9 0x00007ffff7fe8652 in std::_Sp_counted_deleter<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, void ()(std::cxx11::basic_string<char, std::char_traits, std::allocator > const*), std::allocator, (__gnu_cxx::_Lock_policy)2>::_M_dispose (

this=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:526

10 0x00007ffff7ee2b76 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7fff7307c490)

at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:354

11 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (

this=0x7fff7307c490)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:317

12 std::shared_count<(__gnu_cxx::_Lock_policy)2>::~shared_count (

this=0x7fff73c4e760, __in_chrg=<optimized out>)

--Type for more, q to quit, c to continue without paging-- at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1071

13 std::shared_ptr<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, (gnu_cxx::_Lock_policy)2>::~shared_ptr (

this=0x7fff73c4e758, __in_chrg=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1524

14 std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const>::~shared_ptr (this=0x7fff73c4e758,

__in_chrg=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr.h:175

15 intern::interned_string<intern::rc_storage >::~interned_string (this=0x7fff73c4e758, __in_chrg=)

at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.hpp:162

16 Flux::resource_model::detail::evals_t::~evals_t (this=0x7fff73c4e730,

__in_chrg=<optimized out>)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/evaluators/edge_eval_api.cpp:95

17 0x00007ffff7ede427 in std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t>::~pair (this=0x7fff73c4e720, __in_chrg=)

at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_pair.h:185

18 std::destroy_at<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > (__location=0x7fff73c4e720)

at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:88

19 std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > > >::destroy<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > (p=0x7fff73c4e720, a=...)

at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/alloc_traits.h:537

--Type for more, q to quit, c to continue without paging--

20 std::_Rb_tree<intern::interned_string<intern::rc_storage >, std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > >::_M_destroy_node (__p=0x7fff73c4e700,

this=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:623

21 std::_Rb_tree<intern::interned_string<intern::rc_storage >, std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > >::_M_drop_node (this=,

__p=0x7fff73c4e700)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:631

22 std::_Rb_tree<intern::interned_string<intern::rc_storage >, std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > >::_M_erase (this=,

__x=0x7fff73c4e700)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:1937

--Type for more, q to quit, c to continue without paging--

23 std::_Rb_tree<intern::interned_string<intern::rc_storage >, std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t>, std::_Select1st<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> >, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > >::~_Rb_tree (this=0x7fff93ffdad8,

__in_chrg=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_tree.h:984

24 std::map<intern::interned_string<intern::rc_storage >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > >::~map (this=0x7fff93ffdad8, __in_chrg=)

at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_map.h:312

25 boost::container::allocator_traits<boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > > > > >::priv_destroy<std::map<intern::interned_string<intern::rc_storage >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > > > (p=0x7fff93ffdad8)

at /usr/include/boost/container/allocator_traits.hpp:394

26 boost::container::allocator_traits<boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_stor--Type for more, q to quit, c to continue without paging--

age >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > > > > >::destroy<std::map<intern::interned_string<intern::rc_storage >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > > > (p=0x7fff93ffdad8, a=...) at /usr/include/boost/container/allocator_traits.hpp:322

27 boost::container::destroy_alloc_n<boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > > > >, std::map<intern::interned_string<intern::rc_storage >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > >*, unsigned long> (n=0, f=0x7fff93ffdad8, a=...)

at /usr/include/boost/container/detail/copy_move_algo.hpp:960

28 boost::container::vector<std::map<intern::interned_string<intern::rc_storage >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > >, boost::container::small_vector_allocator<boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage<Flux::resource_model::resource_typ--Type for more, q to quit, c to continue without paging--

e_tag> > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > > > > >::~vector (this=0x7fff93ffdac0, __in_chrg=) at /usr/include/boost/container/vector.hpp:1018

29 boost::container::small_vector_base<std::map<intern::interned_string<intern::rc_storage >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > >, boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > > > >::~small_vector_base (this=0x7fff93ffdac0, __in_chrg=)

at /usr/include/boost/container/small_vector.hpp:323

30 boost::container::small_vector<std::map<intern::interned_string<intern::rc_storage >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > >, 2ul, boost::container::new_allocator<std::map<intern::interned_string<intern::rc_storage >, Flux::resource_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > > > >::~small_vector (this=0x7fff93ffdac0, __in_chrg=)

at /usr/include/boost/container/small_vector.hpp:473

31 intern::interned_key_vec<intern::interned_string<intern::dense_storage<Flux::resource_model::subsystem_tag, unsigned char> >, std::map<intern::interned_string<intern::rc_storage >, Flux::resource--Type for more, q to quit, c to continue without paging--

_model::detail::evals_t, std::less<intern::interned_string<intern::rc_storage > >, std::allocator<std::pair<intern::interned_string<intern::rc_storage > const, Flux::resource_model::detail::evals_t> > >, 2>::~interned_key_vec ( this=0x7fff93ffdac0, __in_chrg=) at /builddir/build/BUILD/flux-sched-0.38.0/src/common/libintern/interner.hpp:323

32 Flux::resource_model::scoring_api_t::~scoring_api_t (this=0x7fff93ffdac0,

__in_chrg=<optimized out>)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/evaluators/scoring_api.cpp:35

33 0x00007ffff7eb76fd in Flux::resource_model::detail::dfu_impl_t::dom_dfv (

this=<optimized out>, meta=..., u=<optimized out>, resources=...,
pristine=<optimized out>, excl=0x7fff93ffdbeb, to_parent=...)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu_impl.cpp:765

34 0x00007ffff7eb8206 in Flux::resource_model::detail::dfu_impl_t::select (

***@***.***=0x7fff72b4d290, j=..., ***@***.***=0, meta=...,
***@***.***=false)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu_impl.cpp:1165

35 0x00007ffff7eabb96 in Flux::resource_model::dfu_traverser_t::schedule (

this=0x7fff72b4d290, jobspec=..., meta=..., x=<optimized out>,
op=<optimized out>, root=0, dfv=std::unordered_map with 2 elements = {...})
at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu.cpp:195

36 0x00007ffff7eac4be in Flux::resource_model::dfu_traverser_t::run (

***@***.***=0x7fff72b4d290, jobspec=..., writers=
std::shared_ptr<Flux::resource_model::match_writers_t> (use count 1, weak count 0) = {...}, ***@***.***=MATCH_ALLOCATE_ORELSE_RESERVE,
***@***.***=61557454135100416, ***@***.***=0x7fff93ffe278)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/traversers/dfu.cpp:380

37 0x00007ffff7e8165b in run (

ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...},

--Type for more, q to quit, c to continue without paging-- jobid=61557454135100416, cmd=0x7fff73c4f010 "allocate_orelse_reserve", jstr=..., at=0x7fff93ffe278, errp=0x0) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1665

38 0x00007ffff7e83265 in run_match (

ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...},
jobid=61557454135100416, cmd=0x7fff73c4f010 "allocate_orelse_reserve",
jstr="{\"resources\":[{\"type\":\"node\",\"count\":32,\"exclusive\":true,\"with\":[{\"type\":\"slot\",\"count\":1,\"with\":[{\"type\":\"core\",\"count\":1}],\"label\":\"task\"}]}],\"tasks\":[{\"command\":[\"flux\",\"broker\",\"{{tmpdir}}/script\""..., ***@***.***=0x7fff93ffe280, ***@***.***=0x7fff93ffe278,
overhead=0x7fff93ffe288, o=..., errp=0x0)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/modules/resource_match.cpp:1749

39 0x00007ffff7e8c5e4 in match_multi_request_cb (h=0x7fff72b56d60,

w=<optimized out>, msg=0x7fff8038a4b0, arg=<optimized out>)
at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/new_allocator.h:80

40 0x00007ffff7b6b61a in call_handler (mh=0x7fff72b70810,

***@***.***=0x7fff8038a4b0) at msg_handler.c:344

41 0x00007ffff7b6bc4b in dispatch_message (type=1, msg=0x7fff8038a4b0,

d=0x7fff72b9e640) at msg_handler.c:380

42 handle_cb (r=0x7fff72b71240, hw=, revents=,

arg=0x7fff72b9e640) at msg_handler.c:481

43 0x00007ffff7b94903 in ev_invoke_pending (loop=0x7fff65c53530) at ev.c:3770

44 0x00007ffff7b989a8 in ev_run (flags=0, loop=0x7fff65c53530) at ev.c:4190

45 ev_run (loop=0x7fff65c53530, flags=0) at ev.c:4021

46 0x00007ffff7b6a9ef in flux_reactor_run (r=0x7fff72b71240,

flags=<optimized out>) at reactor.c:124

47 0x00007ffff7e8d793 in mod_main (h=0x7fff72b56d60, argc=,

argv=<optimized out>)
at /builddir/build/BUILD/flux-sched-0.38.0/resource/modules/resource_match.cpp:3021

48 0x000000000041170f in module_thread (arg=0x2ed75b0) at module.c:225

--Type for more, q to quit, c to continue without paging--

49 0x00007ffff792f1ca in start_thread () from /lib64/libpthread.so.0

50 0x00007ffff61c78d3 in clone () from /lib64/libc.so.6j

— Reply to this email directly, view it on GitHubhttps://urldefense.us/v3/__https://github.com/flux-framework/flux-sched/issues/1298*issuecomment-2379344837__;Iw!!G2kpM7uM-TzIFchu!1uEKChpfOPeqG_X0-mj_wj-QHIkPLqEXS7C12lWDSLYBuL0tCKCfPgc9OGAiAXroTb_sLh3FFxlN6vw1NCFYH4h7II8$, or unsubscribehttps://urldefense.us/v3/__https://github.com/notifications/unsubscribe-auth/AAFBFNMZYOM6RVA6MM52SQDZYVPV7AVCNFSM6AAAAABOOSCDDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZZGM2DIOBTG4__;!!G2kpM7uM-TzIFchu!1uEKChpfOPeqG_X0-mj_wj-QHIkPLqEXS7C12lWDSLYBuL0tCKCfPgc9OGAiAXroTb_sLh3FFxlN6vw1NCFYk9p3LfQ$. You are receiving this because you were mentioned.Message ID: @.***>