STEllAR-GROUP / hpx

The C++ Standard Library for Parallelism and Concurrency
https://hpx.stellar-group.org
Boost Software License 1.0
2.54k stars 438 forks source link

Parcelport fails to initialize when multiple jobs run on the same cluster #6097

Open antoniupop opened 1 year ago

antoniupop commented 1 year ago

Expected Behavior

Expected is that multiple independent jobs (e.g. SLURM job array) can run concurrently on the same cluster (on disjoint sets of nodes, not co-scheduled).

Actual Behavior

Only one (or none) of the jobs is able to run while all others crash at initialization with the following errors:

the bootstrap parcelport (tcp) has failed to initialize on locality 0:
<unknown>: HPX(network_error),
bailing out
terminate called without an active exception
srun: error: queue1-dy-m5a2xlarge-1: task 0: Exited with exit code 255
the bootstrap parcelport (tcp) has failed to initialize on locality 4294967295:
<unknown>: HPX(network_error),
bailing out
terminate called without an active exception
the bootstrap parcelport (tcp) has failed to initialize on locality 4294967295:
<unknown>: HPX(network_error),
bailing out

Steps to Reproduce the Problem

Schedule multiple jobs on a SLURM cluster without dependences and only using a subset of the nodes (so allowing the SLURM scheduler to start multiple instances on separate partitions).

Tried to use MPI parcelport and disable TCP to no avail (error changes, but still fails to initialize).

Specifications

hkaiser commented 1 year ago

Disabling the TCP parcelport should help. How did you disable it?

antoniupop commented 1 year ago

Disabling the TCP parcelport should help. How did you disable it?

I've used -DHPX_WITH_PARCELPORT_MPI=ON -DHPX_WITH_PARCELPORT_TCP=OFF when building HPX.

hkaiser commented 1 year ago

Disabling the TCP parcelport should help. How did you disable it?

I've used -DHPX_WITH_PARCELPORT_MPI=ON -DHPX_WITH_PARCELPORT_TCP=OFF when building HPX.

Could you give us the error message you see in this case, please?

antoniupop commented 1 year ago

I've used -DHPX_WITH_PARCELPORT_MPI=ON -DHPX_WITH_PARCELPORT_TCP=OFF when building HPX.

Could you give us the error message you see in this case, please?

I used to get an error message along the lines of failure to initialise Parcelport before, but now it's crashing with the following:

0x7f5be6dfc3c0  : /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f5be6dfc3c0] in /lib/x86_64-linux-gnu/libpthread.so.0
0x7f5be6437513  : /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1(+0x6ec513) [0x7f5be6437513] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be5f8e5f7  : hpx::parcelset::detail::parcel_await_apply(hpx::parcelset::parcel&&, hpx::util::function<void (std::error_code const&, hpx::parcelset::parcel const&), false>&&, unsigned int, hpx::util::unique_function<void (hpx::parcelset::parcel&&, hpx::util::function<void (std::error_code const&, hpx::parcelset::parcel const&), false>&&), false>) [0xc7] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be643cbc2  : void hpx::agas::big_boot_barrier::apply<hpx::actions::direct_action<void (*)(hpx::agas::registration_header const&), &hpx::agas::register_worker, hpx::actions::detail::this_type>, hpx::agas::registration_header>(unsigned int, unsigned int, hpx::parcelset::locality, hpx::actions::direct_action<void (*)(hpx::agas::registration_header const&), &hpx::agas::register_worker, hpx::actions::detail::this_type>, hpx::agas::registration_header&&) [0x1a2] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be64362cc  : hpx::agas::big_boot_barrier::wait_hosted(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) [0x4fc] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be644c563  : hpx::runtime_distributed::initialize_agas() [0x283] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be644fd47  : hpx::runtime_distributed::runtime_distributed(hpx::util::runtime_configuration&, int (*)(hpx::runtime_mode)) [0xf17] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be62e176e  : hpx::detail::run_or_start(hpx::util::function<int (hpx::program_options::variables_map&), false> const&, int, char**, hpx::init_params const&, bool) [0xd8e] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1

I'm not quite sure of what changed for this to now crash instead, still looking to reproduce previous behaviour, but there is no difference between the code run now with MPI parcelport and the initial code using TCP.