cornelisnetworks / opa-psm2

Other
37 stars 29 forks source link

remote /dev/shm failures with PSM2 #42

Closed adrianjhpc closed 5 years ago

adrianjhpc commented 5 years ago

We're using PSM2 with Intel MPI on an Omnipath network and for some of our compute nodes we get failures like this:

opening remote shared memory object in shm_open: No such file or directory (err=9) PSM could not set up shared memory segment (err=9)

MPIR_Init_thread(649).......: MPID_Init(863)..............: MPIDI_NM_mpi_init_hook(1202): OFI get address vector map failed PSM could not set up shared memory segment (err=9) Abort(1094543) on node 5 (rank 5 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:

However, it doesn't happen on all nodes.

All our nodes seem to have the same /dev/shm setup and PSM2 is creating files in /dev/shm even on the nodes that fail.

Any idea what is going wrong?

mwheinz commented 5 years ago

Adrian,

Nothing occurs to me right away, but I'm going to open an internal issue and ask my team for input.

mwheinz commented 5 years ago

Could you let us know what MPI you're using, and what releases of IFS and PSM2 are installed?

adrianjhpc commented 5 years ago

libfabric version: 1.8.0a1-impi Intel(R) MPI Library 2019 Update 3 for Linux PSM2 11.2.77 (although that's not the system version, I'm using a newer version than is installed by default on the compute nodes).

mwheinz commented 5 years ago

So, just so you know - we haven't evaluated libfabric 1.8 yet. Last version we officially built with was 1.6.2 (although we are evaluating the newer releases now).

What distro are your compute nodes using?

adrianjhpc commented 5 years ago

CentOS Linux release 7.5.1804

mwheinz commented 5 years ago

Adrian,

Reviewing the source for PSM2, I feel like there's more diagnostic output that you haven't included. Basically, that error message appears when a process using PSM2 tries to open a shared memory file that was previously created by another process using PSM2. Both processes would be on the same compute node.

The implication is that the other process should have logged an error because it failed to create the shared memory object that this process is failing to open. I would suggest trying to use I_MPI_DEBUG and I_MPI_DEBUG_OUTPUT to try to get more information about what's happening in your app.