Open grondo opened 3 years ago
flux-module: sched-fluxion-resource.stats.get: Function not implemented
I wonder if the failure has something to do with this.
Ah... nevermind. I think this IS correct.
flux-module: flux_open: Connection reset by peer flux: flux_open: No such file or directory
@grondo: does it mean the test flux instance somehow exits, which then led to a series of failures?
Do you see any sign of flux instance SEGV etc (e.g., corefiles?)
Yes, that indicates the broker crashed. It is very difficult to get debug out of the buildfarm builds since it is occuring inside an rpmbuild running in a mock chroot. However, I can try to get further information.
I'm assuming you've tried make check
on a local ppc64le machine and it passes? So these failures are environmental?
I'm assuming you've tried make check on a local ppc64le machine and it passes? So these failures are environmental?
No I haven't tried this. Unfortunately, I won't be able to get to this because my ISCP review due date is COB today.
For the conf test failure:
not ok 15 - qmanager: load must fail on a bad value
FAIL: t1005-qmanager-conf.t 15 - qmanager: load must fail on a bad value
#
# conf_name="13-bad-value" &&
# outfile=${conf_name}.out &&
# test_must_fail start_qmanager ${conf_base}/${conf_name} >${outfile}
#
FWIW, the conf TOML file is:
#
# Configuration for the qmanager module
#
[sched-fluxion-qmanager]
queue-policy = "easy" # queueing policy type
# general queue parameters
# max queue depth (applied to all policies)
# queue-depth (applied to all policies)
queue-params = "max-queue-depth=foo,queue-depth=8192"
# queue policy parameters
# max depth for "conservative" and "hybrid"
# reservation depth for HYBRID
policy-params = "max-reservation-depth=100000,reservation-depth=64"
Hmmm I really don't understand why this would fail on this platform while work on other platform unless there is some sort of a race condition...
Maybe the same "crash" problem....
No I haven't tried this. Unfortunately, I won't be able to get to this because my ISCP review due date is COB today.
If I can complete my review sooner, I will try manual make check
on CORAL2 today. But at the current rate I'm going, I doubt...
Ok I was able to get a core backtrace out of the buildfarm:
Thread 1 (Thread 0x3fff6f7eedc0 (LWP 56060)):
#0 0x00003fffa22a43d8 in raise () from /usr/lib64/libc.so.6
#1 0x00003fffa228448c in abort () from /usr/lib64/libc.so.6
#2 0x00003fffa1e6cfb0 in _Unwind_Resume () from /usr/lib64/libgcc_s.so.1
#3 0x00003fffa13abe1c in boost::property_tree::basic_ptree<std::__cxx11::basic_
string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic
_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__c
xx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::get
_child(boost::property_tree::string_path<std::__cxx11::basic_string<char, std::c
har_traits<char>, std::allocator<char> >, boost::property_tree::id_translator<st
d::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >
> const&) () from /usr/lib64/libboost_graph.so.1.66.0
#4 0x00003fffa13a348c in boost::read_graphml(std::istream&, boost::mutate_graph
&, unsigned long) () from /usr/lib64/libboost_graph.so.1.66.0
#5 0x00003fffa15ccbbc in boost::read_graphml<boost::adjacency_list<boost::vecS,
boost::vecS, boost::directedS, Flux::resource_model::resource_pool_gen_t, Flux:
:resource_model::relation_gen_t, boost::no_property, boost::listS> > (
desired_idx=0, dp=..., g=..., in=...)
at /usr/include/boost/graph/graphml.hpp:211
#6 Flux::resource_model::resource_gen_spec_t::read_graphml (
this=<optimized out>, in=...) at readers/resource_spec_grug.cpp:172
#7 0x00003fffa1602420 in Flux::resource_model::resource_reader_grug_t::unpack
(this=0x3fff640028f0, g=..., m=...,
str="<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE topology SYSTEM \
"hwloc.dtd\">\n<topology>\n <object type=\"Machine\" os_index=\"0\" cpuset=\"0x
0000000f\" complete_cpuset=\"0x0000000f\" online_cpuset=\"0x0000000"...,
rank=<optimized out>) at readers/resource_reader_grug.cpp:442
#8 0x00003fffa1632360 in Flux::resource_model::resource_graph_db_t::load (
this=<optimized out>, str=..., reader=..., rank=<optimized out>)
at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#9 0x00003fffa1580afc in populate_resource_db_file (
ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#10 0x00003fffa1589d98 in populate_resource_db (
ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
at resource_match.cpp:1220
#11 init_resource_graph (
ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
at resource_match.cpp:1322
#12 0x00003fffa158a87c in mod_main (h=0x3fff6403fd70, argc=<optimized out>,
argv=<optimized out>) at resource_match.cpp:2364
#13 0x000000011f29fee4 in module_thread (arg=0x100229cce20) at module.c:214
#14 0x00003fffa26a8878 in start_thread () from /usr/lib64/libpthread.so.0
#15 0x00003fffa23932c8 in clone () from /usr/lib64/libc.so.6
And maybe a different one:
Thread 1 (Thread 0x3fff55b2edc0 (LWP 46250)):
#0 0x00003fff80c343d8 in raise () from /usr/lib64/libc.so.6
#1 0x00003fff80c1448c in abort () from /usr/lib64/libc.so.6
#2 0x00003fff807fcfb0 in _Unwind_Resume () from /usr/lib64/libgcc_s.so.1
#3 0x00003fff55b5677c in __gnu_cxx::__stoa<long, int, char, int>(long (*)(char const*, char**, int), char const*, char const*, unsigned long*, int)::_Save_errno::~_Save_errno() (this=<optimized out>, __in_chrg=<optimized out>)
at /usr/include/c++/8/ext/string_conversions.h:64
#4 __gnu_cxx::__stoa<long, int, char, int> (__convf=0x3fff80c3ac50 <strtoq>,
__name=0x3fff55b7f3e0 "stoi", __str=0x3fff4400eee8 "foo", __idx=0x0)
at /usr/include/c++/8/ext/string_conversions.h:66
#5 0x00003fff55b508c4 in std::__cxx11::stoi (__base=10, __idx=0x0,
__str="foo") at /usr/include/c++/8/bits/basic_string.h:6411
#6 Flux::queue_manager::queue_policy_base_t::apply_params (
this=0x3fff4400ebf0)
at ../../qmanager/policies/base/queue_policy_base_impl.hpp:114
#7 0x00003fff55b5aa28 in Flux::queue_manager::detail::queue_policy_easy_t<Flux::resource_model::detail::reapi_module_t>::apply_params (this=<optimized out>)
at ../../qmanager/policies/queue_policy_easy_impl.hpp:41
#8 0x00003fff55b53230 in enforce_params (prop=..., Python Exception <class 'OverflowError'> int too big to convert:
queue_name=,
ctx=std::shared_ptr<qmanager_ctx_t> (use count 1, weak count 0) = {...})
at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#9 enforce_queues (
ctx=std::shared_ptr<qmanager_ctx_t> (use count 1, weak count 0) = {...})
at qmanager.cpp:356
#10 enforce_options (
ctx=std::shared_ptr<qmanager_ctx_t> (use count 1, weak count 0) = {...})
at qmanager.cpp:380
#11 mod_start (h=0x3fff440060c0, argc=<optimized out>, argv=<optimized out>)
at qmanager.cpp:488
#12 0x00003fff55b545d8 in Flux::cplusplus_wrappers::eh_wrapper_t::operator()<int (*)(flux_handle_struct*, int, char**), flux_handle_struct*&, int&, char**&> (
f=0x3fff55b52b70 <mod_start(flux_handle_struct*, int, char**)>,
this=0x3fff55b2d247) at /usr/include/bits/string_fortified.h:71
#13 mod_main (h=0x3fff440060c0, argc=<optimized out>, argv=0x3fff44008b90)
at qmanager.cpp:521
#14 0x000000010164fee4 in module_thread (arg=0x100052f7800) at module.c:214
#15 0x00003fff81038878 in start_thread () from /usr/lib64/libpthread.so.0
#16 0x00003fff80d232c8 in clone () from /usr/lib64/libc.so.6
First one dies within a boost graph library, starting at:
at /usr/include/boost/graph/graphml.hpp:211
It could be an ABI compatibility issue between the C++ compiler used to produce /usr/lib64/libboost_graph.so.1.66.0
and one that's used to build fluxion. @grondo: do you know what compiler was used to compile fluxion on the build farm?
This needs a bit more thoughts.
The second one crashes at:
std::__cxx11::stoi
This means std::stoi
in this compiler does not tolerate non digit string (foo
). This is somewhat surprising in that it actually raises a SIGNAL instead of throwing an exception like (std::invalid_argument
). Almost smells like a compiler bug. We will need to work around this by patching our code to check isdigit
first. I can work on this once my review is done.
Maybe some of the compiler flags are introducing these extra assertions?
We're using the default GCC in the buildfarm and I doubt there is more than one compiler version installed.
flux-sched version 0.15.0
Prefix...........: /usr
Debug Build......:
C Compiler.......: gcc
C++ Compiler.....: g++
CFLAGS...........: -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mcpu=power8 -mtune=power8 -funwind-tables -fstack-clash-protection
CPPFLAGS..........
CXXFLAGS.......... -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mcpu=power8 -mtune=power8 -funwind-tables -fstack-clash-protection
FLUX.............: /usr/bin/flux
FLUX_VERSION.....: 0.23.1
FLUX_CORE_CFLAGS.:
FLUX_CORE_LIBS...: -lflux-core
LIBFLUX_VERSION..: 0.23.1
FLUX_PREFIX......: /usr
LDFLAGS..........: -Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld
LIBS.............:
Linker...........: /usr/bin/ld -m elf64ppc
Yeah good theory -- This probably is what it is.
It would be good to run make check
under more restricted environment though.
We can easily work around the second issue. Not a problem.
I will need to look more one the first problem whether we can work round this at our code level.
Before I forget. We may want to create a testing instance with these checks in our CI. BTW, did this show up in our x86_64 build farm?
No the x86_64 builder is succeeding
Probably, slight compiler/runtime/architecture differences are causing this, then.
I now believe the first issue has the same root cause as the second then. The first test that causes Flux to crash is t4000-match-params.t
test_expect_success 'loading resource module with incorrect reader fails' '
unload_resource &&
flux dmesg -C &&
load_resource load-file=${xml} load-format=grug \
prune-filters=ALL:core &&
test_must_fail flux module stats sched-fluxion-resource &&
flux dmesg > error3 &&
grep -i "Invalid argument" error3
'
Apparently, additional compiler debug checks in the build farm is causing boost::read_graphml
to crash in the C/C++ standard library when an hwloc-generated XML string is passed instead of graphML format.
Also, FYI, the valgrind test is failing on ppc64le as well.
I just realized I have make check
disabled for flux-core RPMs in toss4 as well. I'm beginning to think there was a reason...
It should be relatively easy to add a simple "graphML" format pre-check at our GRUG reader code before calling boost::read_graphml
to avoid this error. (too bad BGL doesn't do this at its level).
I can work on it along with the other problem.
If you propose something I can relatively easily add patches to the rpm as a stopgap
Sure.
BTW, I was able to get TOSS3 RPMs built without this issue, so this was only a problem on RHEL8/TOSS4.
Good to know. A generic TOSS4 platform issue then. Thanks.
Just let you know I haven't forgotten this. I have a fix for atoi
crash but read_graphml
needs some thoughts. PR and patch
likely by today tough
PR #809 is likely to work around the std::stoi
crash.
But @grondo and I agreed that the fix for boost::read_graphml
crash requires access to an actual TOSS4 machine. We don't have a system yet.
Putting a hold on this issue until then.
@grondo: was this a problem this time?
I left the ppc64le build disabled. I can give it a try today.
On Wed, May 19, 2021, 9:35 PM Dong H. Ahn @.***> wrote:
@grondo https://github.com/grondo: was this a problem this time?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/flux-framework/flux-sched/issues/808#issuecomment-844680414, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFVEURUPPVL6KPUR26622DTOSGSFANCNFSM4WV5F7MQ .
Thanks @grondo!
Still failing on ppc64le:
ok 4 - loading resource module with a nonexistent GRUG fails
PASS: t4000-match-params.t 4 - loading resource module with a nonexistent GRUG f
ails
flux-start: 0 (pid 32540) Aborted
ok 5 - loading resource module with a nonexistent XML fails
PASS: t4000-match-params.t 5 - loading resource module with a nonexistent XML fa
ils
134
not ok 6 - loading resource module with incorrect reader fails
FAIL: t4000-match-params.t 6 - loading resource module with incorrect reader fai
ls
#
# unload_resource &&
# flux dmesg -C &&
# load_resource load-file=${xml} load-format=grug \
# prune-filters=ALL:core &&
# test_must_fail flux module stats sched-fluxion-resource &&
# flux dmesg > error3 &&
# grep -i "Invalid argument" error3
#
not ok 7 - loading resource module with known policies works
FAIL: t4000-match-params.t 7 - loading resource module with known policies works
#
# unload_resource &&
# load_resource policy=high &&
# remove_resource &&
# load_resource policy=low &&
# remove_resource &&
# load_resource policy=locality
#
not ok 8 - loading resource module with unknown policies is tolerated
FAIL: t4000-match-params.t 8 - loading resource module with unknown policies is
tolerated
#
# unload_resource &&
# load_resource policy=foo &&
# remove_resource &&
# load_resource policy=bar
#
not ok 9 - removing resource works
FAIL: t4000-match-params.t 9 - removing resource works
#
# remove_resource
#
# failed 4 among 9 test(s)
automated gdb backtrace:
Core was generated by `/usr/libexec/flux/cmd/flux-broker --setattr=rundir=/tmp/flux-iZZoeh -Slog-filen'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00003fffafa344d8 in raise () from /usr/lib64/libc.so.6
[Current thread is 1 (Thread 0x3fff6fffedc0 (LWP 33010))]
Thread 17 (Thread 0x3fffa88eedc0 (LWP 32565)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffb013a864 in epoll_poll (loop=0x3fff8c006f60,
timeout=<optimized out>) at ev_epoll.c:155
#2 0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff8c006f60) at ev.c:3742
#3 ev_run (loop=0x3fff8c006f60, flags=<optimized out>) at ev.c:3623
#4 0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff8c001d20,
flags=<optimized out>) at reactor.c:126
#5 0x00003fffa8906008 in mod_main (h=0x3fff8c0061c0, argc=<optimized out>,
argv=<optimized out>) at resource.c:427
#6 0x00000001284707cc in module_thread (arg=0x100357713e0) at module.c:205
#7 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 16 (Thread 0x3fffaa44edc0 (LWP 32557)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffb013a864 in epoll_poll (loop=0x3fffa0006f60,
timeout=<optimized out>) at ev_epoll.c:155
#2 0x00003fffb013df14 in ev_run (flags=0, loop=0x3fffa0006f60) at ev.c:3742
#3 ev_run (loop=0x3fffa0006f60, flags=<optimized out>) at ev.c:3623
#4 0x00003fffb00f1b94 in flux_reactor_run (r=0x3fffa0001d20,
flags=<optimized out>) at reactor.c:126
#5 0x00003fffaa634f08 in mod_main (h=0x3fffa00061c0, argc=<optimized out>,
argv=<optimized out>) at content-sqlite.c:632
#6 0x00000001284707cc in module_thread (arg=0x100357652b0) at module.c:205
#7 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 15 (Thread 0x3fff92e9edc0 (LWP 32626)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffb013a864 in epoll_poll (loop=0x3fff84006f60,
timeout=<optimized out>) at ev_epoll.c:155
#2 0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff84006f60) at ev.c:3742
#3 ev_run (loop=0x3fff84006f60, flags=<optimized out>) at ev.c:3623
#4 0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff84001d20,
flags=<optimized out>) at reactor.c:126
#5 0x00003fff92eb51f8 in mod_main (h=0x3fff840061c0, argc=<optimized out>,
argv=<optimized out>) at job-info.c:146
#6 0x00000001284707cc in module_thread (arg=0x100357815d0) at module.c:205
#7 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 14 (Thread 0x3fffa9a7edc0 (LWP 32560)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffb013a864 in epoll_poll (loop=0x3fff94006f60,
timeout=<optimized out>) at ev_epoll.c:155
#2 0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff94006f60) at ev.c:3742
#3 ev_run (loop=0x3fff94006f60, flags=<optimized out>) at ev.c:3623
#4 0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff94001d20,
flags=<optimized out>) at reactor.c:126
#5 0x00003fffa9a9bf74 in mod_main (h=0x3fff940061c0, argc=<optimized out>,
argv=<optimized out>) at kvs.c:2908
#6 0x00000001284707cc in module_thread (arg=0x10035765610) at module.c:205
#7 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 13 (Thread 0x3fff925dedc0 (LWP 32643)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffb013a864 in epoll_poll (loop=0x3fff78006f60,
timeout=<optimized out>) at ev_epoll.c:155
#2 0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff78006f60) at ev.c:3742
#3 ev_run (loop=0x3fff78006f60, flags=<optimized out>) at ev.c:3623
#4 0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff78001d20,
flags=<optimized out>) at reactor.c:126
#5 0x00003fff925f53b4 in mod_main (h=0x3fff780061c0, argc=<optimized out>,
argv=<optimized out>) at job-list.c:155
#6 0x00000001284707cc in module_thread (arg=0x100357846b0) at module.c:205
#7 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 12 (Thread 0x3fffabffedc0 (LWP 32543)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffafee2100 in zmq::epoll_t::loop() [clone .part.11] ()
from /usr/lib64/libzmq.so.5
#2 0x00003fffaff10ec8 in zmq::worker_poller_base_t::worker_routine(void*) ()
from /usr/lib64/libzmq.so.5
#3 0x00003fffaff3c3d0 in thread_routine () from /usr/lib64/libzmq.so.5
#4 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 11 (Thread 0x3fffa919edc0 (LWP 32562)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffb013a864 in epoll_poll (loop=0x3fff98006f60,
timeout=<optimized out>) at ev_epoll.c:155
#2 0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff98006f60) at ev.c:3742
#3 ev_run (loop=0x3fff98006f60, flags=<optimized out>) at ev.c:3623
#4 0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff98001d20,
flags=<optimized out>) at reactor.c:126
#5 0x00003fffa91f67e8 in mod_main (h=0x3fff980061c0, argc=<optimized out>,
argv=<optimized out>) at kvs-watch.c:1165
#6 0x00000001284707cc in module_thread (arg=0x1003576ca40) at module.c:205
#7 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 10 (Thread 0x3fffaaebedc0 (LWP 32553)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffb013a864 in epoll_poll (loop=0x3fff9c006f60,
timeout=<optimized out>) at ev_epoll.c:155
#2 0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff9c006f60) at ev.c:3742
#3 ev_run (loop=0x3fff9c006f60, flags=<optimized out>) at ev.c:3623
#4 0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff9c001d20,
flags=<optimized out>) at reactor.c:126
#5 0x00003fffaaed4228 in mod_main (h=0x3fff9c0061c0, argc=<optimized out>,
argv=<optimized out>) at barrier.c:499
#6 0x00000001284707cc in module_thread (arg=0x100357642f0) at module.c:205
#7 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 9 (Thread 0x3fff91d3edc0 (LWP 32657)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffb013a864 in epoll_poll (loop=0x3fff7c006f60,
timeout=<optimized out>) at ev_epoll.c:155
#2 0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff7c006f60) at ev.c:3742
#3 ev_run (loop=0x3fff7c006f60, flags=<optimized out>) at ev.c:3623
#4 0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff7c001d20,
flags=<optimized out>) at reactor.c:126
#5 0x00003fff91d577e0 in mod_main (h=0x3fff7c0061c0, argc=<optimized out>,
argv=<optimized out>) at job-ingest.c:1036
#6 0x00000001284707cc in module_thread (arg=0x10035786a40) at module.c:205
#7 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 8 (Thread 0x3fff93ffedc0 (LWP 32605)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffb013a864 in epoll_poll (loop=0x3fff88006f60,
timeout=<optimized out>) at ev_epoll.c:155
#2 0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff88006f60) at ev.c:3742
#3 ev_run (loop=0x3fff88006f60, flags=<optimized out>) at ev.c:3623
#4 0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff88001d20,
flags=<optimized out>) at reactor.c:126
#5 0x00003fffa8068c38 in mod_main (h=0x3fff880061c0, ac=<optimized out>,
av=<optimized out>) at cron.c:906
#6 0x00000001284707cc in module_thread (arg=0x100357715e0) at module.c:205
#7 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 7 (Thread 0x3fff9148edc0 (LWP 32664)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffb013a864 in epoll_poll (loop=0x3fff70006f60,
timeout=<optimized out>) at ev_epoll.c:155
#2 0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff70006f60) at ev.c:3742
#3 ev_run (loop=0x3fff70006f60, flags=<optimized out>) at ev.c:3623
#4 0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff70001d20,
flags=<optimized out>) at reactor.c:126
#5 0x00003fff914a7518 in mod_main (h=0x3fff700061c0, argc=<optimized out>,
argv=<optimized out>) at job-exec.c:1128
#6 0x00000001284707cc in module_thread (arg=0x10035789100) at module.c:205
#7 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 6 (Thread 0x3fff9374edc0 (LWP 32612)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffb013a864 in epoll_poll (loop=0x3fff80006f60,
timeout=<optimized out>) at ev_epoll.c:155
#2 0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff80006f60) at ev.c:3742
#3 ev_run (loop=0x3fff80006f60, flags=<optimized out>) at ev.c:3623
#4 0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff80001d20,
flags=<optimized out>) at reactor.c:126
#5 0x00003fff93766784 in mod_main (h=0x3fff800061c0, argc=<optimized out>,
argv=<optimized out>) at job-manager.c:205
#6 0x00000001284707cc in module_thread (arg=0x10035779170) at module.c:205
#7 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 5 (Thread 0x3fffac80edc0 (LWP 32542)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffafee2100 in zmq::epoll_t::loop() [clone .part.11] ()
from /usr/lib64/libzmq.so.5
#2 0x00003fffaff10ec8 in zmq::worker_poller_base_t::worker_routine(void*) ()
from /usr/lib64/libzmq.so.5
#3 0x00003fffaff3c3d0 in thread_routine () from /usr/lib64/libzmq.so.5
#4 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x3fff90bfedc0 (LWP 32668)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffb013a864 in epoll_poll (loop=0x3fff74006f60,
timeout=<optimized out>) at ev_epoll.c:155
#2 0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff74006f60) at ev.c:3742
#3 ev_run (loop=0x3fff74006f60, flags=<optimized out>) at ev.c:3623
#4 0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff74001d20,
flags=<optimized out>) at reactor.c:126
#5 0x00003fff90c132e0 in mod_main (h=0x3fff740061c0, argc=<optimized out>,
argv=<optimized out>) at heartbeat.c:159
#6 0x00000001284707cc in module_thread (arg=0x1003578b6e0) at module.c:205
#7 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x3fffb021d820 (LWP 32540)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffb013a864 in epoll_poll (
loop=0x3fffb0194f10 <default_loop_struct>, timeout=<optimized out>)
at ev_epoll.c:155
#2 0x00003fffb013df14 in ev_run (flags=0,
loop=0x3fffb0194f10 <default_loop_struct>) at ev.c:3742
#3 ev_run (loop=0x3fffb0194f10 <default_loop_struct>, flags=<optimized out>)
at ev.c:3623
#4 0x00003fffb00f1b94 in flux_reactor_run (r=0x100357359e0,
flags=<optimized out>) at reactor.c:126
#5 0x000000012846c2b4 in main (argc=<optimized out>, argv=<optimized out>)
at broker.c:505
Thread 2 (Thread 0x3fffab7eedc0 (LWP 32544)):
#0 0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00003fffb013a864 in epoll_poll (loop=0x3fffa4007720,
timeout=<optimized out>) at ev_epoll.c:155
#2 0x00003fffb013df14 in ev_run (flags=0, loop=0x3fffa4007720) at ev.c:3742
#3 ev_run (loop=0x3fffa4007720, flags=<optimized out>) at ev.c:3623
#4 0x00003fffb00f1b94 in flux_reactor_run (r=0x3fffa4007700,
flags=<optimized out>) at reactor.c:126
#5 0x00003fffac82393c in mod_main (h=0x3fffa40068b0, argc=<optimized out>,
argv=<optimized out>) at local.c:328
#6 0x00000001284707cc in module_thread (arg=0x10035745630) at module.c:205
#7 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x3fff6fffedc0 (LWP 33010)):
#0 0x00003fffafa344d8 in raise () from /usr/lib64/libc.so.6
#1 0x00003fffafa1462c in abort () from /usr/lib64/libc.so.6
#2 0x00003fffaf5fcfb0 in _Unwind_Resume () from /usr/lib64/libgcc_s.so.1
#3 0x00003fffaeb2be1c in boost::property_tree::basic_ptree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::get_child(boost::property_tree::string_path<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::property_tree::id_translator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /usr/lib64/libboost_graph.so.1.66.0
#4 0x00003fffaeb2348c in boost::read_graphml(std::istream&, boost::mutate_graph&, unsigned long) () from /usr/lib64/libboost_graph.so.1.66.0
#5 0x00003fffaed515dc in boost::read_graphml<boost::adjacency_list<boost::vecS, boost::vecS, boost::directedS, Flux::resource_model::resource_pool_gen_t, Flux::resource_model::relation_gen_t, boost::no_property, boost::listS> > (
desired_idx=0, dp=..., g=..., in=...)
at /usr/include/boost/graph/graphml.hpp:211
#6 Flux::resource_model::resource_gen_spec_t::read_graphml (
this=<optimized out>, in=...) at readers/resource_spec_grug.cpp:172
#7 0x00003fffaed87580 in Flux::resource_model::resource_reader_grug_t::unpack
(this=0x3fff6000f8f0, g=..., m=..., str=..., rank=<optimized out>)
at readers/resource_reader_grug.cpp:471
#8 0x00003fffaedb7f20 in Flux::resource_model::resource_graph_db_t::load (
this=<optimized out>, str=..., reader=..., rank=<optimized out>)
at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#9 0x00003fffaed03848 in populate_resource_db_file (
ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#10 0x00003fffaed0cf98 in populate_resource_db (
ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
at resource_match.cpp:1226
#11 init_resource_graph (
ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
at resource_match.cpp:1328
#12 0x00003fffaed0daa0 in mod_main (h=0x3fff60001750, argc=<optimized out>,
argv=<optimized out>) at resource_match.cpp:2449
#13 0x00000001284707cc in module_thread (arg=0x10035759150) at module.c:205
#14 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#15 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
I tested ppc64le build while building RPMs for v0.17.0 and this issue still occurs.
Still an issue with 0.26.0. Backtrace:
Thread 1 (Thread 0x3fff7571edc0 (LWP 71532)):
#0 0x00003fffa0eaa498 in raise () from /lib64/libc.so.6
#1 0x00003fffa0e84a54 in abort () from /lib64/libc.so.6
#2 0x00003fffa09ccfb0 in _Unwind_Resume () from /lib64/libgcc_s.so.1
#3 0x00003fff9fdbbe3c in boost::property_tree::basic_ptree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::get_child(boost::property_tree::string_path<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::property_tree::id_translator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /lib64/libboost_graph.so.1.66.0
#4 0x00003fff9fdb34ac in boost::read_graphml(std::istream&, boost::mutate_graph&, unsigned long) () from /lib64/libboost_graph.so.1.66.0
#5 0x00003fff9ff4a86c in boost::read_graphml<boost::adjacency_list<boost::vecS, boost::vecS, boost::directedS, Flux::resource_model::resource_pool_gen_t, Flux::resource_model::relation_gen_t, boost::no_property, boost::listS> > (
desired_idx=0, dp=..., g=..., in=...)
at /usr/include/boost/graph/graphml.hpp:211
#6 Flux::resource_model::resource_gen_spec_t::read_graphml (
this=<optimized out>, in=...) at readers/resource_spec_grug.cpp:158
#7 0x00003fff9ff808d0 in Flux::resource_model::resource_reader_grug_t::unpack
(this=0x3fff5c0073c0, g=..., m=..., str=..., rank=<optimized out>)
at readers/resource_reader_grug.cpp:458
#8 0x00003fff9ffbcfe0 in Flux::resource_model::resource_graph_db_t::load (
this=<optimized out>, str=..., reader=..., rank=<optimized out>)
at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#9 0x00003fff9fef0fe4 in populate_resource_db_file (
ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#10 0x00003fff9fefc948 in populate_resource_db (
ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
at resource_match.cpp:1309
#11 init_resource_graph (
ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
at resource_match.cpp:1413
#12 0x00003fff9fefd114 in mod_main (h=0x3fff5c0093b0, argc=<optimized out>,
argv=<optimized out>) at resource_match.cpp:2661
#13 0x00000000100140f4 in module_thread (arg=0x10014d8ca70) at module.c:183
#14 0x00003fffa14c9718 in start_thread () from /lib64/libpthread.so.0
#15 0x00003fffa0f9b498 in clone () from /lib64/libc.so.6
The post-mortem here is that somehow we have been ending up with the ppc64le build linking two unwinding implementations at the same time, both libstdc++ and libunwind, and they're slicing each other. I haven't explicitly checked if this is continuing, but best as I can tell it's an issue caused by the way one of our dependencies is being built.
Oh, yeah, sorry I never updated this issue! I think we'd need to rebuild zeromq not to include libunwind in TOSS 4 to work around this...
The following tests are failing on ppc64le when trying to build an RPM.