cornelisnetworks / opa-psm2

Other
37 stars 29 forks source link

Fix random crash on non-GPU messages with PSM2-CUDA. #62

Closed RemiLacroix-IDRIS closed 2 years ago

RemiLacroix-IDRIS commented 2 years ago

Hello,

Since we updated to the latest version of PSM2 last week, we had multiple reports of users getting random crashes on nodes with PSM2-cuda installed:

r10i1n2.3262490TrioCFD_opt: CUDA failure: cuIpcOpenMemHandle() (at /home/scm/gitrepo/ifs-all/components/psm/temp.build-cuda/BUILD/libpsm2-11.2.204/ptl_am/am_cuda_memhandle_cache.c:467) returned 709
r10i1n2.3262490Error returned from CUDA function.

[r10i1n2:3262490] *** Process received signal ***
[r10i1n2:3262490] Signal: Aborted (6)
[r10i1n2:3262490] Signal code:  (-6)
[r10i1n2:3262490] [ 0] /lib64/libpthread.so.0(+0x12dd0)[0x1545f75efdd0]
[r10i1n2:3262490] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x1545f600770f]
[r10i1n2:3262490] [ 2] /lib64/libc.so.6(abort+0x127)[0x1545f5ff1b25]
[r10i1n2:3262490] [ 3] /lib64/libpsm2.so.2(+0x1d3f0)[0x1545f48eb3f0]
[r10i1n2:3262490] [ 4] /lib64/libpsm2.so.2(+0x13d9f)[0x1545f48e1d9f]
[r10i1n2:3262490] [ 5] /lib64/libpsm2.so.2(+0x11778)[0x1545f48df778]
[r10i1n2:3262490] [ 6] /lib64/libpsm2.so.2(+0x23f91)[0x1545f48f1f91]
[r10i1n2:3262490] [ 7] /lib64/libpsm2.so.2(psm2_mq_irecv2+0x6bf)[0x1545f48f287f]
[r10i1n2:3262490] [ 8] /.../spack_soft/openmpi/4.0.5/gcc-8.3.1-bqhhozyx6t5mwgqbouag3faxchj6u4y2/lib/libmpi.so.40(ompi_mtl_psm2_irecv+0xaa)[0x1545f6ce732a]
[r10i1n2:3262490] [ 9] /.../spack_soft/openmpi/4.0.5/gcc-8.3.1-bqhhozyx6t5mwgqbouag3faxchj6u4y2/lib/libmpi.so.40(+0x1e7b3b)[0x1545f6d5eb3b]
[r10i1n2:3262490] [10] /.../spack_soft/openmpi/4.0.5/gcc-8.3.1-bqhhozyx6t5mwgqbouag3faxchj6u4y2/lib/libmpi.so.40(PMPI_Recv+0x145)[0x1545f6c25955]
[r10i1n2:3262490] [11] /.../Version_beta_jean-zay-gpu/Composants/triocfd/TrioCFD_opt(_ZNK14Comm_Group_MPI4recvEiPvii+0xc3)[0x1501533]
[r10i1n2:3262490] [12] /.../Version_beta_jean-zay-gpu/Composants/triocfd/TrioCFD_opt(_Z8recevoirR7Objet_Uiii+0x99)[0x15287a9]
[r10i1n2:3262490] [13] /.../Version_beta_jean-zay-gpu/Composants/triocfd/TrioCFD_opt(_ZN5Sonde11initialiserEv+0xe28)[0x142b148]
[r10i1n2:3262490] [14] /.../Version_beta_jean-zay-gpu/Composants/triocfd/TrioCFD_opt(_ZN5Sonde9completerEv+0x541)[0x142cb61]

This seemed weird at first because it happens on MPI communications that did not involve any GPU buffers and in some cases the PSM2_CUDA environment variable was even set to 0.

Looking more closely at the code, I think the cuda_ipc_handle_attached variable is not properly initialized which could cause a GPU-only code path to be mistakenly activated with CPU messages: https://github.com/RemiLacroix-IDRIS/opa-psm2/blob/50d46f7fed637557c2f8ade9eea971766536c84b/ptl_am/ptl.c#L86-L92

This PR seems to fix the issue according to our tests.

BrendanCunningham commented 2 years ago

Thanks. We'll test this and review as soon as possible.