LLNL / Aluminum

High-performance, GPU-aware communication library
https://aluminum.readthedocs.io/en/latest/
Other
84 stars 21 forks source link

In-place MPI scatter segfaults on one processor #90

Closed ndryden closed 1 year ago

ndryden commented 3 years ago
$ jsrun --bind packed:8 --nrs 1 --rs_per_host 1 --tasks_per_rs 1 --launch_distribution packed --cpu_per_rs ALL_CPUS --gpu_per_rs ALL_GPUS ./test_ops.exe --backend mpi --op scatter --inplace
Aborting after hang in Al size=1

Scatterv also hangs.

Regular scatter works fine. Also works with more processors. Likewise, other rooted collectives (bcast, gather, reduce) work in this config. This is a bit strange since this should be a NOP.

Need to also verify this is not a SMPI bug.

ndryden commented 3 years ago

This is actually a segfault that appears to be a bug in SpectrumMPI. The hang is because of our signal handler sucking.

A simple reproducer results in a segfault on Lassen:

#include <vector>
#include <mpi.h>

int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);
  std::vector<float> buf(1, 0.0f);
  MPI_Scatter(buf.data(), 1, MPI_FLOAT, MPI_IN_PLACE, 1, MPI_FLOAT, 0, MPI_COMM_WORLD);
  MPI_Finalize();
}

This appears to only occur when using in-place MPI_Scatter with a buffer provided by an std::vector and a single process. It works fine if the buffer is e.g. allocated with new. It also works on Pascal with MVAPICH.

ndryden commented 1 year ago

This has been fixed in the latest SpectrumMPI release.