STEllAR-GROUP / hpx

The C++ Standard Library for Parallelism and Concurrency
https://hpx.stellar-group.org
Boost Software License 1.0
2.53k stars 430 forks source link

--hpx:queuing=shared fails for distributed runs #6190

Open hkaiser opened 1 year ago

hkaiser commented 1 year ago

From IRC:

[16:37] beojan: I've noticed that if I use the `--hpx:queuing=shared` option to enable a shared queue across hardware threads, my program crashes when I run it through mpirun with -n >= 2.
[16:38] beojan: I originally noticed this with my Gaudi port, but it also happens with my toy demo: https://github.com/beojan/HPXDemo
[16:39] beojan: Here's the error:
[16:39] beojan: {os-thread}: locality#1/worker-thread#1
[16:39] beojan: {thread-description}: <unknown>
[16:39] beojan: {state}: not running
[16:39] beojan: {auxinfo}: 
[16:39] beojan: {file}: /home/beojan/Development/src/hpx/src/hpx-1.8.1/libs/core/schedulers/include/hpx/schedulers/thread_queue_mc.hpp
[16:39] beojan: {line}: 247
[16:39] beojan: {function}: thread_queue_mc::create_thread
[16:39] beojan: {what}: staged tasks must have 'pending' as their initial state: HPX(bad_parameter)
beojan commented 1 year ago

Here's the full stacktrace for that thread (now with hpx-1.9.0-rc1):

{stack-trace}: 13 frames:
0x7f06aeeb12bb  : /usr/lib/libhpx.so.1(+0x4b12bb) [0x7f06aeeb12bb] in /usr/lib/libhpx.so.1
0x7f06ae7387ec  : std::__exception_ptr::exception_ptr hpx::detail::get_exception<hpx::exception>(hpx::exception const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) [0xac] in /usr/lib/libhpx_core.so
0x7f06ae738906  : void hpx::detail::throw_exception<hpx::exception>(hpx::exception const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) [0x76] in /usr/lib/libhpx_core.so
0x7f06ae73e3c1  : hpx::detail::throw_exception(hpx::error, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) [0xd1] in /usr/lib/libhpx_core.so
0x7f06ae86a70b  : /usr/lib/libhpx_core.so(+0x26a70b) [0x7f06ae86a70b] in /usr/lib/libhpx_core.so
0x7f06ae804c54  : hpx::threads::detail::create_background_thread(hpx::threads::policies::scheduler_base&, unsigned long, hpx::threads::detail::scheduling_callbacks&, std::shared_ptr<bool>&, long&) [0x1a4] in /usr/lib/libhpx_core.so
0x7f06ae86bb0e  : /usr/lib/libhpx_core.so(+0x26bb0e) [0x7f06ae86bb0e] in /usr/lib/libhpx_core.so
0x7f06ae86c795  : hpx::threads::detail::scheduled_thread_pool<hpx::threads::policies::shared_priority_queue_scheduler<std::mutex, hpx::threads::policies::concurrentqueue_fifo, hpx::threads::policies::lockfree_lifo> >::thread_func(unsigned long, unsigned long, std::shared_ptr<hpx::util::barrier>) [0x4f5] in /usr/lib/libhpx_core.so
0x7f06ae816695  : std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (hpx::threads::detail::scheduled_thread_pool<hpx::threads::policies::shared_priority_queue_scheduler<std::mutex, hpx::threads::policies::concurrentqueue_fifo, hpx::threads::policies::lockfree_lifo> >::*)(unsigned long, unsigned long, std::shared_ptr<hpx::util::barrier>), hpx::threads::detail::scheduled_thread_pool<hpx::threads::policies::shared_priority_queue_scheduler<std::mutex, hpx::threads::policies::concurrentqueue_fifo, hpx::threads::policies::lockfree_lifo> >*, unsigned long, unsigned long, std::shared_ptr<hpx::util::barrier> > > >::_M_run() [0x55] in /usr/lib/libhpx_core.so
0x7f067cad72c3  : /usr/lib/libstdc++.so.6(+0xd72c3) [0x7f067cad72c3] in /usr/lib/libstdc++.so.6
0x7f067c89ebb5  : /usr/lib/libc.so.6(+0x85bb5) [0x7f067c89ebb5] in /usr/lib/libc.so.6
0x7f067c920d90  : /usr/lib/libc.so.6(+0x107d90) [0x7f067c920d90] in /usr/lib/libc.so.6
{locality-id}: 1
{hostname}: [ (mpi:1) (tcp:127.0.0.1:7911) ]
{process-id}: 68100
{os-thread}: locality#1/worker-thread#5
{thread-description}: <unknown>
{state}: state::pre_main
{auxinfo}: 
{file}: /home/beojan/Development/src/hpx/src/hpx-1.9.0-rc1/libs/core/schedulers/include/hpx/schedulers/thread_queue_mc.hpp
{line}: 249
{function}: thread_queue_mc::create_thread
{what}: staged tasks must have 'pending' as their initial state: HPX(bad_parameter)
hkaiser commented 1 year ago

@beojan I'm not able to reproduce this issue locally. What application did you run?

beojan commented 1 year ago

My demo app is at https://github.com/beojan/HPXDemo.

beojan commented 1 year ago

If I use the Intel mpirun executable (with the demo linked to OpenMPI) it doesn't crash but this is a clearly faulty setup because of the mismatch between the mpirun version and the libmpi version.

My Gaudi port understandably crashes during MPI initialization with such a setup.

hkaiser commented 1 year ago

@beojan would you have more information on how we could reproduce this issue? Are you using any specific environment?

beojan commented 1 year ago

With the demo, I'm running on my laptop (Arch Linux) with HPX 1.9.0-rc1 and OpenMPI 4.1.

You can comment out the TBB and CUDA demos in the CMake file, though you do need oneMKL available to build it.