ECP-copa / CabanaMD

Molecular dynamics proxy application based on Cabana
Other
19 stars 13 forks source link

Intranode GPU communication crashes in MPI called from Cabana::Gather::apply() #106

Open patrickb314 opened 1 year ago

patrickb314 commented 1 year ago

CabanaMD with the standard in.lj testcase crashes on both LLNL Lassen (spectrum MPI or mvapich2) and LANL Chicoma (craypich) when communicating between GPUs on the same node. It works when communicating inter-node, though I expect this is because MPI is not being as strict in error checking for data sending as the RMA routines MPI uses for intra-node communication. I've enabled GPU-aware communication in all cases.

The MPI_Send call invoked by Cabana::Gather::apply() (line 335 of Cabana_Halo.cpp) appears to be what is crashing. Here's the Lassen lwcore traceback from spectrum MPI:

__GI___assert_fail@assert.c:101
PAMI::Protocol::Get::GetRdma<PAMI::Device::Shmem::DmaModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<P
AMI::Fifo::FifoPacket<64u,@libpami.so.3
PAMI::Protocol::Get::CompositeRGet<PAMI::Protocol::Get::RGet,@libpami.so.3
PAMI::Context::rget_impl(pami_rget_simple_t*)@libpami.so.3
PAMI_Rget@libpami.so.3
process_rndv_msg@mca_pml_pami.so
pml_pami_recv_rndv_cb@mca_pml_pami.so
PAMI::Protocol::Send::EagerSimple<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::Wr
apFifo<PAMI::Fifo::FifoPacket<64u,@libpami.so.3
PAMI_Context_advancev@libpami.so.3
mca_pml_pami_progress_wait@mca_pml_pami.so
mca_pml_pami_send@mca_pml_pami.so
PMPI_Send@libmpi_ibm.so.3
Cabana::Gather<Cabana::Halo<Kokkos::Device<Kokkos::Cuda,@()
void@()
Comm<System<Kokkos::Device<Kokkos::Cuda,@()
CbnMD<System<Kokkos::Device<Kokkos::Cuda,@()
main@()
---STACK
streeve commented 1 year ago

Thanks for the details - I'll test this out when I'm back from travel next week. Looks like I also need to manually restart the CI periodically to try to catch this type of bug

patrickb314 commented 1 year ago

Unclear on more exploration that this is a CabanaMD problem. I'm seeing multiple cases where small GPU-GPU intranode sends are crashing on those systems, but haven't yet been able to isolate. I'll update as I find out more.