Open antoniupop opened 1 year ago
Disabling the TCP parcelport should help. How did you disable it?
Disabling the TCP parcelport should help. How did you disable it?
I've used -DHPX_WITH_PARCELPORT_MPI=ON -DHPX_WITH_PARCELPORT_TCP=OFF when building HPX.
Disabling the TCP parcelport should help. How did you disable it?
I've used -DHPX_WITH_PARCELPORT_MPI=ON -DHPX_WITH_PARCELPORT_TCP=OFF when building HPX.
Could you give us the error message you see in this case, please?
I've used -DHPX_WITH_PARCELPORT_MPI=ON -DHPX_WITH_PARCELPORT_TCP=OFF when building HPX.
Could you give us the error message you see in this case, please?
I used to get an error message along the lines of failure to initialise Parcelport before, but now it's crashing with the following:
0x7f5be6dfc3c0 : /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f5be6dfc3c0] in /lib/x86_64-linux-gnu/libpthread.so.0
0x7f5be6437513 : /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1(+0x6ec513) [0x7f5be6437513] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be5f8e5f7 : hpx::parcelset::detail::parcel_await_apply(hpx::parcelset::parcel&&, hpx::util::function<void (std::error_code const&, hpx::parcelset::parcel const&), false>&&, unsigned int, hpx::util::unique_function<void (hpx::parcelset::parcel&&, hpx::util::function<void (std::error_code const&, hpx::parcelset::parcel const&), false>&&), false>) [0xc7] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be643cbc2 : void hpx::agas::big_boot_barrier::apply<hpx::actions::direct_action<void (*)(hpx::agas::registration_header const&), &hpx::agas::register_worker, hpx::actions::detail::this_type>, hpx::agas::registration_header>(unsigned int, unsigned int, hpx::parcelset::locality, hpx::actions::direct_action<void (*)(hpx::agas::registration_header const&), &hpx::agas::register_worker, hpx::actions::detail::this_type>, hpx::agas::registration_header&&) [0x1a2] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be64362cc : hpx::agas::big_boot_barrier::wait_hosted(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) [0x4fc] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be644c563 : hpx::runtime_distributed::initialize_agas() [0x283] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be644fd47 : hpx::runtime_distributed::runtime_distributed(hpx::util::runtime_configuration&, int (*)(hpx::runtime_mode)) [0xf17] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
0x7f5be62e176e : hpx::detail::run_or_start(hpx::util::function<int (hpx::program_options::variables_map&), false> const&, int, char**, hpx::init_params const&, bool) [0xd8e] in /shared/concrete-compiler-internal/compiler/hpx-1.7.1/build/lib/libhpx.so.1
I'm not quite sure of what changed for this to now crash instead, still looking to reproduce previous behaviour, but there is no difference between the code run now with MPI parcelport and the initial code using TCP.
Expected Behavior
Expected is that multiple independent jobs (e.g. SLURM job array) can run concurrently on the same cluster (on disjoint sets of nodes, not co-scheduled).
Actual Behavior
Only one (or none) of the jobs is able to run while all others crash at initialization with the following errors:
Steps to Reproduce the Problem
Schedule multiple jobs on a SLURM cluster without dependences and only using a subset of the nodes (so allowing the SLURM scheduler to start multiple instances on separate partitions).
Tried to use MPI parcelport and disable TCP to no avail (error changes, but still fails to initialize).
Specifications