flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

make check fails in ppc64le buildfarm builder #808

Open grondo opened 3 years ago

grondo commented 3 years ago

The following tests are failing on ppc64le when trying to build an RPM.

expecting success: 
    unload_resource &&
    load_resource load-file=${grug} \
load-format=grug prune-filters=ALL:core
ok 1 - loading resource module with a tiny machine GRUG works
loading resource module with a tiny machine GRUG works
expecting success: 
    unload_resource &&
    load_resource load-file=${xml} \
load-format=hwloc prune-filters=ALL:core
ok 2 - loading resource module with an XML works
loading resource module with an XML works
expecting success: 
    unload_resource &&
    load_resource prune-filters=ALL:core
ok 3 - loading resource module with no option works
loading resource module with no option works
expecting success: 
    unload_resource &&
    flux dmesg -C &&
    load_resource load-file=${ne_grug} load-format=grug \
prune-filters=ALL:core &&
    test_must_fail flux module stats sched-fluxion-resource &&
    flux dmesg > error1 &&
    test_must_fail grep -i Success error1
flux-module: sched-fluxion-resource.stats.get: Function not implemented
ok 4 - loading resource module with a nonexistent GRUG fails
loading resource module with a nonexistent GRUG fails
expecting success: 
    unload_resource &&
    flux dmesg -C &&
    load_resource load-file=${ne_xml} load-format=hwloc \
prune-filters=ALL:core &&
    test_must_fail flux module stats sched-fluxion-resource &&
    flux dmesg > error2 &&
    test_must_fail grep -i Success error2
flux-module: sched-fluxion-resource.stats.get: Function not implemented
ok 5 - loading resource module with a nonexistent XML fails
loading resource module with a nonexistent XML fails
expecting success: 
    unload_resource &&
    flux dmesg -C &&
    load_resource load-file=${xml} load-format=grug \
prune-filters=ALL:core &&
    test_must_fail flux module stats sched-fluxion-resource &&
    flux dmesg > error3 &&
    grep -i "Invalid argument" error3
flux-module: flux_open: Connection reset by peer
flux: flux_open: No such file or directory
not ok 6 - loading resource module with incorrect reader fails
expecting success: 
    unload_resource &&
    load_resource policy=high &&
    remove_resource &&
    load_resource policy=low &&
    remove_resource &&
    load_resource policy=locality
flux-module: flux_open: No such file or directory
not ok 7 - loading resource module with known policies works
expecting success: 
    unload_resource &&
    load_resource policy=foo &&
    remove_resource &&
    load_resource policy=bar
flux-module: flux_open: No such file or directory
not ok 8 - loading resource module with unknown policies is tolerated
expecting success: 
    remove_resource
flux-module: flux_open: No such file or directory
not ok 9 - removing resource works
# failed 4 among 9 test(s)
1..9
not ok 15 - qmanager: load must fail on a bad value
FAIL: t1005-qmanager-conf.t 15 - qmanager: load must fail on a bad value
#       
#           conf_name="13-bad-value" &&
#           outfile=${conf_name}.out &&
#           test_must_fail start_qmanager ${conf_base}/${conf_name} >${outfile}
#       
dongahn commented 3 years ago

flux-module: sched-fluxion-resource.stats.get: Function not implemented

I wonder if the failure has something to do with this.

dongahn commented 3 years ago

Ah... nevermind. I think this IS correct.

dongahn commented 3 years ago

flux-module: flux_open: Connection reset by peer flux: flux_open: No such file or directory

@grondo: does it mean the test flux instance somehow exits, which then led to a series of failures?

dongahn commented 3 years ago

Do you see any sign of flux instance SEGV etc (e.g., corefiles?)

grondo commented 3 years ago

Yes, that indicates the broker crashed. It is very difficult to get debug out of the buildfarm builds since it is occuring inside an rpmbuild running in a mock chroot. However, I can try to get further information.

I'm assuming you've tried make check on a local ppc64le machine and it passes? So these failures are environmental?

dongahn commented 3 years ago

I'm assuming you've tried make check on a local ppc64le machine and it passes? So these failures are environmental?

No I haven't tried this. Unfortunately, I won't be able to get to this because my ISCP review due date is COB today.

dongahn commented 3 years ago

For the conf test failure:

not ok 15 - qmanager: load must fail on a bad value
FAIL: t1005-qmanager-conf.t 15 - qmanager: load must fail on a bad value
#       
#           conf_name="13-bad-value" &&
#           outfile=${conf_name}.out &&
#           test_must_fail start_qmanager ${conf_base}/${conf_name} >${outfile}
#

FWIW, the conf TOML file is:

#
# Configuration for the qmanager module
#

[sched-fluxion-qmanager]

queue-policy = "easy"                # queueing policy type

# general queue parameters
    # max queue depth (applied to all policies)
    # queue-depth (applied to all policies)
queue-params = "max-queue-depth=foo,queue-depth=8192"

# queue policy parameters
    # max depth for "conservative" and "hybrid"
    # reservation depth for HYBRID
policy-params = "max-reservation-depth=100000,reservation-depth=64"
dongahn commented 3 years ago

Hmmm I really don't understand why this would fail on this platform while work on other platform unless there is some sort of a race condition...

Maybe the same "crash" problem....

dongahn commented 3 years ago

No I haven't tried this. Unfortunately, I won't be able to get to this because my ISCP review due date is COB today.

If I can complete my review sooner, I will try manual make check on CORAL2 today. But at the current rate I'm going, I doubt...

grondo commented 3 years ago

Ok I was able to get a core backtrace out of the buildfarm:

Thread 1 (Thread 0x3fff6f7eedc0 (LWP 56060)):
#0  0x00003fffa22a43d8 in raise () from /usr/lib64/libc.so.6
#1  0x00003fffa228448c in abort () from /usr/lib64/libc.so.6
#2  0x00003fffa1e6cfb0 in _Unwind_Resume () from /usr/lib64/libgcc_s.so.1
#3  0x00003fffa13abe1c in boost::property_tree::basic_ptree<std::__cxx11::basic_
string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic
_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__c
xx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::get
_child(boost::property_tree::string_path<std::__cxx11::basic_string<char, std::c
har_traits<char>, std::allocator<char> >, boost::property_tree::id_translator<st
d::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > 
> const&) () from /usr/lib64/libboost_graph.so.1.66.0
#4  0x00003fffa13a348c in boost::read_graphml(std::istream&, boost::mutate_graph
&, unsigned long) () from /usr/lib64/libboost_graph.so.1.66.0
#5  0x00003fffa15ccbbc in boost::read_graphml<boost::adjacency_list<boost::vecS,
 boost::vecS, boost::directedS, Flux::resource_model::resource_pool_gen_t, Flux:
:resource_model::relation_gen_t, boost::no_property, boost::listS> > (
    desired_idx=0, dp=..., g=..., in=...)
    at /usr/include/boost/graph/graphml.hpp:211
#6  Flux::resource_model::resource_gen_spec_t::read_graphml (
    this=<optimized out>, in=...) at readers/resource_spec_grug.cpp:172
#7  0x00003fffa1602420 in Flux::resource_model::resource_reader_grug_t::unpack
    (this=0x3fff640028f0, g=..., m=..., 
    str="<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE topology SYSTEM \
"hwloc.dtd\">\n<topology>\n  <object type=\"Machine\" os_index=\"0\" cpuset=\"0x
0000000f\" complete_cpuset=\"0x0000000f\" online_cpuset=\"0x0000000"..., 
    rank=<optimized out>) at readers/resource_reader_grug.cpp:442
#8  0x00003fffa1632360 in Flux::resource_model::resource_graph_db_t::load (
    this=<optimized out>, str=..., reader=..., rank=<optimized out>)
    at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#9  0x00003fffa1580afc in populate_resource_db_file (
    ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
    at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#10 0x00003fffa1589d98 in populate_resource_db (
    ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
    at resource_match.cpp:1220
#11 init_resource_graph (
    ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
    at resource_match.cpp:1322
#12 0x00003fffa158a87c in mod_main (h=0x3fff6403fd70, argc=<optimized out>, 
    argv=<optimized out>) at resource_match.cpp:2364
#13 0x000000011f29fee4 in module_thread (arg=0x100229cce20) at module.c:214
#14 0x00003fffa26a8878 in start_thread () from /usr/lib64/libpthread.so.0
#15 0x00003fffa23932c8 in clone () from /usr/lib64/libc.so.6
grondo commented 3 years ago

And maybe a different one:

Thread 1 (Thread 0x3fff55b2edc0 (LWP 46250)):
#0  0x00003fff80c343d8 in raise () from /usr/lib64/libc.so.6
#1  0x00003fff80c1448c in abort () from /usr/lib64/libc.so.6
#2  0x00003fff807fcfb0 in _Unwind_Resume () from /usr/lib64/libgcc_s.so.1
#3  0x00003fff55b5677c in __gnu_cxx::__stoa<long, int, char, int>(long (*)(char const*, char**, int), char const*, char const*, unsigned long*, int)::_Save_errno::~_Save_errno() (this=<optimized out>, __in_chrg=<optimized out>)
    at /usr/include/c++/8/ext/string_conversions.h:64
#4  __gnu_cxx::__stoa<long, int, char, int> (__convf=0x3fff80c3ac50 <strtoq>, 
    __name=0x3fff55b7f3e0 "stoi", __str=0x3fff4400eee8 "foo", __idx=0x0)
    at /usr/include/c++/8/ext/string_conversions.h:66
#5  0x00003fff55b508c4 in std::__cxx11::stoi (__base=10, __idx=0x0, 
    __str="foo") at /usr/include/c++/8/bits/basic_string.h:6411
#6  Flux::queue_manager::queue_policy_base_t::apply_params (
    this=0x3fff4400ebf0)
    at ../../qmanager/policies/base/queue_policy_base_impl.hpp:114
#7  0x00003fff55b5aa28 in Flux::queue_manager::detail::queue_policy_easy_t<Flux::resource_model::detail::reapi_module_t>::apply_params (this=<optimized out>)
    at ../../qmanager/policies/queue_policy_easy_impl.hpp:41
#8  0x00003fff55b53230 in enforce_params (prop=..., Python Exception <class 'OverflowError'> int too big to convert: 
queue_name=, 
    ctx=std::shared_ptr<qmanager_ctx_t> (use count 1, weak count 0) = {...})
    at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#9  enforce_queues (
    ctx=std::shared_ptr<qmanager_ctx_t> (use count 1, weak count 0) = {...})
    at qmanager.cpp:356
#10 enforce_options (
    ctx=std::shared_ptr<qmanager_ctx_t> (use count 1, weak count 0) = {...})
    at qmanager.cpp:380
#11 mod_start (h=0x3fff440060c0, argc=<optimized out>, argv=<optimized out>)
    at qmanager.cpp:488
#12 0x00003fff55b545d8 in Flux::cplusplus_wrappers::eh_wrapper_t::operator()<int (*)(flux_handle_struct*, int, char**), flux_handle_struct*&, int&, char**&> (
    f=0x3fff55b52b70 <mod_start(flux_handle_struct*, int, char**)>, 
    this=0x3fff55b2d247) at /usr/include/bits/string_fortified.h:71
#13 mod_main (h=0x3fff440060c0, argc=<optimized out>, argv=0x3fff44008b90)
    at qmanager.cpp:521
#14 0x000000010164fee4 in module_thread (arg=0x100052f7800) at module.c:214
#15 0x00003fff81038878 in start_thread () from /usr/lib64/libpthread.so.0
#16 0x00003fff80d232c8 in clone () from /usr/lib64/libc.so.6
dongahn commented 3 years ago

First one dies within a boost graph library, starting at:

at /usr/include/boost/graph/graphml.hpp:211

It could be an ABI compatibility issue between the C++ compiler used to produce /usr/lib64/libboost_graph.so.1.66.0 and one that's used to build fluxion. @grondo: do you know what compiler was used to compile fluxion on the build farm?

This needs a bit more thoughts.

The second one crashes at:

std::__cxx11::stoi

This means std::stoi in this compiler does not tolerate non digit string (foo). This is somewhat surprising in that it actually raises a SIGNAL instead of throwing an exception like (std::invalid_argument). Almost smells like a compiler bug. We will need to work around this by patching our code to check isdigit first. I can work on this once my review is done.

grondo commented 3 years ago

Maybe some of the compiler flags are introducing these extra assertions?

We're using the default GCC in the buildfarm and I doubt there is more than one compiler version installed.

  flux-sched version 0.15.0
  Prefix...........: /usr
  Debug Build......: 
  C Compiler.......: gcc
  C++ Compiler.....: g++
  CFLAGS...........: -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mcpu=power8 -mtune=power8 -funwind-tables -fstack-clash-protection
  CPPFLAGS.......... 
  CXXFLAGS.......... -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mcpu=power8 -mtune=power8 -funwind-tables -fstack-clash-protection
  FLUX.............: /usr/bin/flux
  FLUX_VERSION.....: 0.23.1
  FLUX_CORE_CFLAGS.: 
  FLUX_CORE_LIBS...: -lflux-core 
  LIBFLUX_VERSION..: 0.23.1
  FLUX_PREFIX......: /usr
  LDFLAGS..........: -Wl,-z,relro  -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld
  LIBS.............: 
  Linker...........: /usr/bin/ld -m elf64ppc
dongahn commented 3 years ago

Yeah good theory -- This probably is what it is.

It would be good to run make check under more restricted environment though.

We can easily work around the second issue. Not a problem.

I will need to look more one the first problem whether we can work round this at our code level.

dongahn commented 3 years ago

Before I forget. We may want to create a testing instance with these checks in our CI. BTW, did this show up in our x86_64 build farm?

grondo commented 3 years ago

No the x86_64 builder is succeeding

dongahn commented 3 years ago

Probably, slight compiler/runtime/architecture differences are causing this, then.

dongahn commented 3 years ago

I now believe the first issue has the same root cause as the second then. The first test that causes Flux to crash is t4000-match-params.t

test_expect_success 'loading resource module with incorrect reader fails' '
    unload_resource &&
    flux dmesg -C &&
    load_resource load-file=${xml} load-format=grug \
prune-filters=ALL:core &&
    test_must_fail flux module stats sched-fluxion-resource &&
    flux dmesg > error3 &&
    grep -i "Invalid argument" error3
'

Apparently, additional compiler debug checks in the build farm is causing boost::read_graphml to crash in the C/C++ standard library when an hwloc-generated XML string is passed instead of graphML format.

grondo commented 3 years ago

Also, FYI, the valgrind test is failing on ppc64le as well.

I just realized I have make check disabled for flux-core RPMs in toss4 as well. I'm beginning to think there was a reason...

dongahn commented 3 years ago

It should be relatively easy to add a simple "graphML" format pre-check at our GRUG reader code before calling boost::read_graphml to avoid this error. (too bad BGL doesn't do this at its level).

I can work on it along with the other problem.

grondo commented 3 years ago

If you propose something I can relatively easily add patches to the rpm as a stopgap

dongahn commented 3 years ago

Sure.

grondo commented 3 years ago

BTW, I was able to get TOSS3 RPMs built without this issue, so this was only a problem on RHEL8/TOSS4.

dongahn commented 3 years ago

Good to know. A generic TOSS4 platform issue then. Thanks.

dongahn commented 3 years ago

Just let you know I haven't forgotten this. I have a fix for atoi crash but read_graphml needs some thoughts. PR and patch likely by today tough

dongahn commented 3 years ago

PR #809 is likely to work around the std::stoi crash.

But @grondo and I agreed that the fix for boost::read_graphml crash requires access to an actual TOSS4 machine. We don't have a system yet.

Putting a hold on this issue until then.

dongahn commented 3 years ago

@grondo: was this a problem this time?

grondo commented 3 years ago

I left the ppc64le build disabled. I can give it a try today.

On Wed, May 19, 2021, 9:35 PM Dong H. Ahn @.***> wrote:

@grondo https://github.com/grondo: was this a problem this time?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/flux-framework/flux-sched/issues/808#issuecomment-844680414, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFVEURUPPVL6KPUR26622DTOSGSFANCNFSM4WV5F7MQ .

dongahn commented 3 years ago

Thanks @grondo!

grondo commented 3 years ago

Still failing on ppc64le:

ok 4 - loading resource module with a nonexistent GRUG fails
PASS: t4000-match-params.t 4 - loading resource module with a nonexistent GRUG f
ails
flux-start: 0 (pid 32540) Aborted
ok 5 - loading resource module with a nonexistent XML fails
PASS: t4000-match-params.t 5 - loading resource module with a nonexistent XML fa
ils
134
not ok 6 - loading resource module with incorrect reader fails
FAIL: t4000-match-params.t 6 - loading resource module with incorrect reader fai
ls
#       
#           unload_resource &&
#           flux dmesg -C &&
#           load_resource load-file=${xml} load-format=grug \
#       prune-filters=ALL:core &&
#           test_must_fail flux module stats sched-fluxion-resource &&
#           flux dmesg > error3 &&
#           grep -i "Invalid argument" error3
#       
not ok 7 - loading resource module with known policies works
FAIL: t4000-match-params.t 7 - loading resource module with known policies works
#       
#           unload_resource &&
#           load_resource policy=high &&
#           remove_resource &&
#           load_resource policy=low &&
#           remove_resource &&
#           load_resource policy=locality
#       
not ok 8 - loading resource module with unknown policies is tolerated
FAIL: t4000-match-params.t 8 - loading resource module with unknown policies is 
tolerated
#       
#           unload_resource &&
#           load_resource policy=foo &&
#           remove_resource &&
#           load_resource policy=bar
#       
not ok 9 - removing resource works
FAIL: t4000-match-params.t 9 - removing resource works
#       
#           remove_resource
#       
# failed 4 among 9 test(s)

automated gdb backtrace:

Core was generated by `/usr/libexec/flux/cmd/flux-broker --setattr=rundir=/tmp/flux-iZZoeh -Slog-filen'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00003fffafa344d8 in raise () from /usr/lib64/libc.so.6
[Current thread is 1 (Thread 0x3fff6fffedc0 (LWP 33010))]
Thread 17 (Thread 0x3fffa88eedc0 (LWP 32565)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffb013a864 in epoll_poll (loop=0x3fff8c006f60, 
    timeout=<optimized out>) at ev_epoll.c:155
#2  0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff8c006f60) at ev.c:3742
#3  ev_run (loop=0x3fff8c006f60, flags=<optimized out>) at ev.c:3623
#4  0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff8c001d20, 
    flags=<optimized out>) at reactor.c:126
#5  0x00003fffa8906008 in mod_main (h=0x3fff8c0061c0, argc=<optimized out>, 
    argv=<optimized out>) at resource.c:427
#6  0x00000001284707cc in module_thread (arg=0x100357713e0) at module.c:205
#7  0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 16 (Thread 0x3fffaa44edc0 (LWP 32557)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffb013a864 in epoll_poll (loop=0x3fffa0006f60, 
    timeout=<optimized out>) at ev_epoll.c:155
#2  0x00003fffb013df14 in ev_run (flags=0, loop=0x3fffa0006f60) at ev.c:3742
#3  ev_run (loop=0x3fffa0006f60, flags=<optimized out>) at ev.c:3623
#4  0x00003fffb00f1b94 in flux_reactor_run (r=0x3fffa0001d20, 
    flags=<optimized out>) at reactor.c:126
#5  0x00003fffaa634f08 in mod_main (h=0x3fffa00061c0, argc=<optimized out>, 
    argv=<optimized out>) at content-sqlite.c:632
#6  0x00000001284707cc in module_thread (arg=0x100357652b0) at module.c:205
#7  0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 15 (Thread 0x3fff92e9edc0 (LWP 32626)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffb013a864 in epoll_poll (loop=0x3fff84006f60, 
    timeout=<optimized out>) at ev_epoll.c:155
#2  0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff84006f60) at ev.c:3742
#3  ev_run (loop=0x3fff84006f60, flags=<optimized out>) at ev.c:3623
#4  0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff84001d20, 
    flags=<optimized out>) at reactor.c:126
#5  0x00003fff92eb51f8 in mod_main (h=0x3fff840061c0, argc=<optimized out>, 
    argv=<optimized out>) at job-info.c:146
#6  0x00000001284707cc in module_thread (arg=0x100357815d0) at module.c:205
#7  0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 14 (Thread 0x3fffa9a7edc0 (LWP 32560)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffb013a864 in epoll_poll (loop=0x3fff94006f60, 
    timeout=<optimized out>) at ev_epoll.c:155
#2  0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff94006f60) at ev.c:3742
#3  ev_run (loop=0x3fff94006f60, flags=<optimized out>) at ev.c:3623
#4  0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff94001d20, 
    flags=<optimized out>) at reactor.c:126
#5  0x00003fffa9a9bf74 in mod_main (h=0x3fff940061c0, argc=<optimized out>, 
    argv=<optimized out>) at kvs.c:2908
#6  0x00000001284707cc in module_thread (arg=0x10035765610) at module.c:205
#7  0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 13 (Thread 0x3fff925dedc0 (LWP 32643)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffb013a864 in epoll_poll (loop=0x3fff78006f60, 
    timeout=<optimized out>) at ev_epoll.c:155
#2  0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff78006f60) at ev.c:3742
#3  ev_run (loop=0x3fff78006f60, flags=<optimized out>) at ev.c:3623
#4  0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff78001d20, 
    flags=<optimized out>) at reactor.c:126
#5  0x00003fff925f53b4 in mod_main (h=0x3fff780061c0, argc=<optimized out>, 
    argv=<optimized out>) at job-list.c:155
#6  0x00000001284707cc in module_thread (arg=0x100357846b0) at module.c:205
#7  0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 12 (Thread 0x3fffabffedc0 (LWP 32543)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffafee2100 in zmq::epoll_t::loop() [clone .part.11] ()
   from /usr/lib64/libzmq.so.5
#2  0x00003fffaff10ec8 in zmq::worker_poller_base_t::worker_routine(void*) ()
   from /usr/lib64/libzmq.so.5
#3  0x00003fffaff3c3d0 in thread_routine () from /usr/lib64/libzmq.so.5
#4  0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 11 (Thread 0x3fffa919edc0 (LWP 32562)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffb013a864 in epoll_poll (loop=0x3fff98006f60, 
    timeout=<optimized out>) at ev_epoll.c:155
#2  0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff98006f60) at ev.c:3742
#3  ev_run (loop=0x3fff98006f60, flags=<optimized out>) at ev.c:3623
#4  0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff98001d20, 
    flags=<optimized out>) at reactor.c:126
#5  0x00003fffa91f67e8 in mod_main (h=0x3fff980061c0, argc=<optimized out>, 
    argv=<optimized out>) at kvs-watch.c:1165
#6  0x00000001284707cc in module_thread (arg=0x1003576ca40) at module.c:205
#7  0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 10 (Thread 0x3fffaaebedc0 (LWP 32553)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffb013a864 in epoll_poll (loop=0x3fff9c006f60, 
    timeout=<optimized out>) at ev_epoll.c:155
#2  0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff9c006f60) at ev.c:3742
#3  ev_run (loop=0x3fff9c006f60, flags=<optimized out>) at ev.c:3623
#4  0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff9c001d20, 
    flags=<optimized out>) at reactor.c:126
#5  0x00003fffaaed4228 in mod_main (h=0x3fff9c0061c0, argc=<optimized out>, 
    argv=<optimized out>) at barrier.c:499
#6  0x00000001284707cc in module_thread (arg=0x100357642f0) at module.c:205
#7  0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 9 (Thread 0x3fff91d3edc0 (LWP 32657)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffb013a864 in epoll_poll (loop=0x3fff7c006f60, 
    timeout=<optimized out>) at ev_epoll.c:155
#2  0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff7c006f60) at ev.c:3742
#3  ev_run (loop=0x3fff7c006f60, flags=<optimized out>) at ev.c:3623
#4  0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff7c001d20, 
    flags=<optimized out>) at reactor.c:126
#5  0x00003fff91d577e0 in mod_main (h=0x3fff7c0061c0, argc=<optimized out>, 
    argv=<optimized out>) at job-ingest.c:1036
#6  0x00000001284707cc in module_thread (arg=0x10035786a40) at module.c:205
#7  0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 8 (Thread 0x3fff93ffedc0 (LWP 32605)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffb013a864 in epoll_poll (loop=0x3fff88006f60, 
    timeout=<optimized out>) at ev_epoll.c:155
#2  0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff88006f60) at ev.c:3742
#3  ev_run (loop=0x3fff88006f60, flags=<optimized out>) at ev.c:3623
#4  0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff88001d20, 
    flags=<optimized out>) at reactor.c:126
#5  0x00003fffa8068c38 in mod_main (h=0x3fff880061c0, ac=<optimized out>, 
    av=<optimized out>) at cron.c:906
#6  0x00000001284707cc in module_thread (arg=0x100357715e0) at module.c:205
#7  0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 7 (Thread 0x3fff9148edc0 (LWP 32664)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffb013a864 in epoll_poll (loop=0x3fff70006f60, 
    timeout=<optimized out>) at ev_epoll.c:155
#2  0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff70006f60) at ev.c:3742
#3  ev_run (loop=0x3fff70006f60, flags=<optimized out>) at ev.c:3623
#4  0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff70001d20, 
    flags=<optimized out>) at reactor.c:126
#5  0x00003fff914a7518 in mod_main (h=0x3fff700061c0, argc=<optimized out>, 
    argv=<optimized out>) at job-exec.c:1128
#6  0x00000001284707cc in module_thread (arg=0x10035789100) at module.c:205
#7  0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 6 (Thread 0x3fff9374edc0 (LWP 32612)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffb013a864 in epoll_poll (loop=0x3fff80006f60, 
    timeout=<optimized out>) at ev_epoll.c:155
#2  0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff80006f60) at ev.c:3742
#3  ev_run (loop=0x3fff80006f60, flags=<optimized out>) at ev.c:3623
#4  0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff80001d20, 
    flags=<optimized out>) at reactor.c:126
#5  0x00003fff93766784 in mod_main (h=0x3fff800061c0, argc=<optimized out>, 
    argv=<optimized out>) at job-manager.c:205
#6  0x00000001284707cc in module_thread (arg=0x10035779170) at module.c:205
#7  0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 5 (Thread 0x3fffac80edc0 (LWP 32542)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffafee2100 in zmq::epoll_t::loop() [clone .part.11] ()
   from /usr/lib64/libzmq.so.5
#2  0x00003fffaff10ec8 in zmq::worker_poller_base_t::worker_routine(void*) ()
   from /usr/lib64/libzmq.so.5
#3  0x00003fffaff3c3d0 in thread_routine () from /usr/lib64/libzmq.so.5
#4  0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x3fff90bfedc0 (LWP 32668)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffb013a864 in epoll_poll (loop=0x3fff74006f60, 
    timeout=<optimized out>) at ev_epoll.c:155
#2  0x00003fffb013df14 in ev_run (flags=0, loop=0x3fff74006f60) at ev.c:3742
#3  ev_run (loop=0x3fff74006f60, flags=<optimized out>) at ev.c:3623
#4  0x00003fffb00f1b94 in flux_reactor_run (r=0x3fff74001d20, 
    flags=<optimized out>) at reactor.c:126
#5  0x00003fff90c132e0 in mod_main (h=0x3fff740061c0, argc=<optimized out>, 
    argv=<optimized out>) at heartbeat.c:159
#6  0x00000001284707cc in module_thread (arg=0x1003578b6e0) at module.c:205
#7  0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x3fffb021d820 (LWP 32540)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffb013a864 in epoll_poll (
    loop=0x3fffb0194f10 <default_loop_struct>, timeout=<optimized out>)
    at ev_epoll.c:155
#2  0x00003fffb013df14 in ev_run (flags=0, 
    loop=0x3fffb0194f10 <default_loop_struct>) at ev.c:3742
#3  ev_run (loop=0x3fffb0194f10 <default_loop_struct>, flags=<optimized out>)
    at ev.c:3623
#4  0x00003fffb00f1b94 in flux_reactor_run (r=0x100357359e0, 
    flags=<optimized out>) at reactor.c:126
#5  0x000000012846c2b4 in main (argc=<optimized out>, argv=<optimized out>)
    at broker.c:505
Thread 2 (Thread 0x3fffab7eedc0 (LWP 32544)):
#0  0x00003fffafb2343c in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00003fffb013a864 in epoll_poll (loop=0x3fffa4007720, 
    timeout=<optimized out>) at ev_epoll.c:155
#2  0x00003fffb013df14 in ev_run (flags=0, loop=0x3fffa4007720) at ev.c:3742
#3  ev_run (loop=0x3fffa4007720, flags=<optimized out>) at ev.c:3623
#4  0x00003fffb00f1b94 in flux_reactor_run (r=0x3fffa4007700, 
    flags=<optimized out>) at reactor.c:126
#5  0x00003fffac82393c in mod_main (h=0x3fffa40068b0, argc=<optimized out>, 
    argv=<optimized out>) at local.c:328
#6  0x00000001284707cc in module_thread (arg=0x10035745630) at module.c:205
#7  0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#8  0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x3fff6fffedc0 (LWP 33010)):
#0  0x00003fffafa344d8 in raise () from /usr/lib64/libc.so.6
#1  0x00003fffafa1462c in abort () from /usr/lib64/libc.so.6
#2  0x00003fffaf5fcfb0 in _Unwind_Resume () from /usr/lib64/libgcc_s.so.1
#3  0x00003fffaeb2be1c in boost::property_tree::basic_ptree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::get_child(boost::property_tree::string_path<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::property_tree::id_translator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /usr/lib64/libboost_graph.so.1.66.0
#4  0x00003fffaeb2348c in boost::read_graphml(std::istream&, boost::mutate_graph&, unsigned long) () from /usr/lib64/libboost_graph.so.1.66.0
#5  0x00003fffaed515dc in boost::read_graphml<boost::adjacency_list<boost::vecS, boost::vecS, boost::directedS, Flux::resource_model::resource_pool_gen_t, Flux::resource_model::relation_gen_t, boost::no_property, boost::listS> > (
    desired_idx=0, dp=..., g=..., in=...)
    at /usr/include/boost/graph/graphml.hpp:211
#6  Flux::resource_model::resource_gen_spec_t::read_graphml (
    this=<optimized out>, in=...) at readers/resource_spec_grug.cpp:172
#7  0x00003fffaed87580 in Flux::resource_model::resource_reader_grug_t::unpack
    (this=0x3fff6000f8f0, g=..., m=..., str=..., rank=<optimized out>)
    at readers/resource_reader_grug.cpp:471
#8  0x00003fffaedb7f20 in Flux::resource_model::resource_graph_db_t::load (
    this=<optimized out>, str=..., reader=..., rank=<optimized out>)
    at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#9  0x00003fffaed03848 in populate_resource_db_file (
    ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
    at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#10 0x00003fffaed0cf98 in populate_resource_db (
    ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
    at resource_match.cpp:1226
#11 init_resource_graph (
    ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
    at resource_match.cpp:1328
#12 0x00003fffaed0daa0 in mod_main (h=0x3fff60001750, argc=<optimized out>, 
    argv=<optimized out>) at resource_match.cpp:2449
#13 0x00000001284707cc in module_thread (arg=0x10035759150) at module.c:205
#14 0x00003fffafe38878 in start_thread () from /usr/lib64/libpthread.so.0
#15 0x00003fffafb22f68 in clone () from /usr/lib64/libc.so.6
grondo commented 2 years ago

I tested ppc64le build while building RPMs for v0.17.0 and this issue still occurs.

grondo commented 1 year ago

Still an issue with 0.26.0. Backtrace:

Thread 1 (Thread 0x3fff7571edc0 (LWP 71532)):
#0  0x00003fffa0eaa498 in raise () from /lib64/libc.so.6
#1  0x00003fffa0e84a54 in abort () from /lib64/libc.so.6
#2  0x00003fffa09ccfb0 in _Unwind_Resume () from /lib64/libgcc_s.so.1
#3  0x00003fff9fdbbe3c in boost::property_tree::basic_ptree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::get_child(boost::property_tree::string_path<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::property_tree::id_translator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /lib64/libboost_graph.so.1.66.0
#4  0x00003fff9fdb34ac in boost::read_graphml(std::istream&, boost::mutate_graph&, unsigned long) () from /lib64/libboost_graph.so.1.66.0
#5  0x00003fff9ff4a86c in boost::read_graphml<boost::adjacency_list<boost::vecS, boost::vecS, boost::directedS, Flux::resource_model::resource_pool_gen_t, Flux::resource_model::relation_gen_t, boost::no_property, boost::listS> > (
    desired_idx=0, dp=..., g=..., in=...)
    at /usr/include/boost/graph/graphml.hpp:211
#6  Flux::resource_model::resource_gen_spec_t::read_graphml (
    this=<optimized out>, in=...) at readers/resource_spec_grug.cpp:158
#7  0x00003fff9ff808d0 in Flux::resource_model::resource_reader_grug_t::unpack
    (this=0x3fff5c0073c0, g=..., m=..., str=..., rank=<optimized out>)
    at readers/resource_reader_grug.cpp:458
#8  0x00003fff9ffbcfe0 in Flux::resource_model::resource_graph_db_t::load (
    this=<optimized out>, str=..., reader=..., rank=<optimized out>)
    at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#9  0x00003fff9fef0fe4 in populate_resource_db_file (
    ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
    at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#10 0x00003fff9fefc948 in populate_resource_db (
    ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
    at resource_match.cpp:1309
#11 init_resource_graph (
    ctx=std::shared_ptr<resource_ctx_t> (use count 1, weak count 0) = {...})
    at resource_match.cpp:1413
#12 0x00003fff9fefd114 in mod_main (h=0x3fff5c0093b0, argc=<optimized out>, 
    argv=<optimized out>) at resource_match.cpp:2661
#13 0x00000000100140f4 in module_thread (arg=0x10014d8ca70) at module.c:183
#14 0x00003fffa14c9718 in start_thread () from /lib64/libpthread.so.0
#15 0x00003fffa0f9b498 in clone () from /lib64/libc.so.6
trws commented 11 months ago

The post-mortem here is that somehow we have been ending up with the ppc64le build linking two unwinding implementations at the same time, both libstdc++ and libunwind, and they're slicing each other. I haven't explicitly checked if this is continuing, but best as I can tell it's an issue caused by the way one of our dependencies is being built.

grondo commented 11 months ago

Oh, yeah, sorry I never updated this issue! I think we'd need to rebuild zeromq not to include libunwind in TOSS 4 to work around this...