cornelisnetworks / opa-psm2

Other
36 stars 29 forks source link

Shm failure with PSM2 #48

Open adrianjhpc opened 4 years ago

adrianjhpc commented 4 years ago

Running using Intel MPI and PSM2 on a dual rail Omnipath network we're getting these errors with some applications:

Error opening remote shared memory object in shm_open: No such file or directory (err=9) PSM could not set up shared memory segment (err=9)

When we look in /dev/shm we see psm2_shm.295510000000000020e02 type files, but it is still failing. We've tried cleaning up /dev/shm but it does not seem to help.

We've seen this for PSM2 10.3.46, 11.2.23, 11.2.77, and 11.2.78.

Any idea what's going wrong?

mwheinz commented 4 years ago

Adrian,

You're problem doesn't ring any bells, but I've opened an internal bug report for it. Could you give me a little more info? What version of IFS are you using and on what distro you're using it?

mwheinz commented 4 years ago

Also - which MPI you're using; if you could provide the mpirun command line it would help us understand what might be going on.

adrianjhpc commented 4 years ago

Thanks.

CentOS Linux release 7.5.1804 Intel(R) MPI Library, Version 2019 Update 3 Build 20190214 (id: b645a4a54) For MPI run stuff:

export FI_PROVIDER=psm2 export PSM2_MULTIRAIL=1 export PSM2_MULTIRAIL_MAP=0:1,1:1 export PSM2_MULTI_EP=1 export PSM2_DEVICES=self,shm,hfi export OMP_NUM_THREADS=2 mpirun -genvall -n 960 -ppn 48 ...

Can you "remind me" how to get the IFS version?

mwheinz commented 4 years ago

opaconfig -V should do it.

adrianjhpc commented 4 years ago

opaconfig -V reports:

10.8.0.0.204

mwheinz commented 4 years ago

Thanks. Are you using the version of PSM2 that comes packaged with Intel MPI or the upstream version?

adrianjhpc commented 4 years ago

For this it was the PSM2 with Intel MPI but we do have other versions installed on the system.

mwheinz commented 4 years ago

okay.

mwheinz commented 4 years ago

Adrian, thinking about it, We've never tested PSM2 in conjunction with OMP and we don't provide strong protections for using PSM2 in a multi-threaded environment. Does the problem still exist if you set export OMP_NUM_THREADS=1?

adrianjhpc commented 4 years ago

I can check. I should say that we're not doing any MPI from within OpenMP regions, but I'll check nevertheless.

adrianjhpc commented 4 years ago

Using a single OpenMP thread doesn't help I'm afraid. How would you suggest I debug the issue, can build my own PSM2 source and modify the part failing to see what's going wrong?

Here's a stack trace of the current failure (or at least the relevant part):

0 pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:24

1 0x00002acf2dd7ab00 in psmi_amsh_short_request () from /lib64/libpsm2.so.2

2 0x00002acf2dd79bdf in amsh_ep_connreq_poll () from /lib64/libpsm2.so.2

3 0x00002acf2dd7bd4b in amsh_ep_connect () from /lib64/libpsm2.so.2

4 0x00002acf2dd8aed6 in psm2_ep_connect () from /lib64/libpsm2.so.2

adrianjhpc commented 4 years ago

Debugging PSM2 a bit, the error is happening in this function:

psm2_error_t psmi_shm_create(ptl_t *ptl_gen)

It's this bit of code that's failing:

    for (iterator = 0; iterator <= INT_MAX; iterator++) {
            snprintf(shmbuf,
                     sizeof(shmbuf),
                     "/psm2_shm.%ld%016lx%d",
                     (long int) getuid(),
                     epid,
                     iterator);
            dest_shmfd = shm_open(shmbuf, O_RDWR, S_IRWXU);
            if (dest_shmfd < 0) {
                    if (errno == EACCES && iterator < INT_MAX)
                            continue;
                    else {
                            err = psmi_handle_error(NULL,
                                                    PSM2_SHMEM_SEGMENT_ERR,
                                                    "Error opening remote "
                                                    "shared memory object "
                                                    "in shm_open: %s",
                                                    strerror(errno));

                    goto fail;
            }
            shmfd =
                shm_open(amsh_keyname, O_RDWR, S_IRUSR | S_IWUSR);

Where it is looking for a specific psm2 file in /dev/shm that isn't on the current host but is on a remote host (I can find it by searching all the /dev/shm on the hosts that have been used for the run). For instance, this failed on node 22 but the file it failed on (psm2_shm.2955100000000001624020) was on node 17.

mwheinz commented 4 years ago

Okay - try a workaround. in your mpirun line add

-X PSM2_DEVICES=self,hfi

That will disable the shm device. I have no explanation for why a machine would be trying to open a shared memory handle on a different machine.

adrianjhpc commented 4 years ago

It definitely works if we disable the shm device, we're just trying to get shm to work for better performance.

mwheinz commented 4 years ago

It definitely works if we disable the shm device, we're just trying to get shm to work for better performance.

Adrian, I know it's been 10 days, I just wanted to let you know we are looking at this.

mwheinz commented 4 years ago

We have some ideas, but we were wondering if you could try adding the following to a test run:

-x PSM2_TRACEMASK=0x40 -x HFI_DEBUG_FILENAME="/tmp/%h.%p.out"

This will generate a ton of output in the .out files, but the contents should tell us if different machines are really trying to communicate over SHM.

adrianjhpc commented 4 years ago

Thanks for looking into this. I'll try that out and let you now what it produces.

adrianjhpc commented 3 years ago

I appreciate some time has passed, but I have had some time to get back and play with PSM2 to find out where the problem is occurring.

I've isolated it (with an OpenMPI application using PSM2) to the function psmi_shm_map_remote in the file ptl_am/am_reqrep_shmem.c. (note this was playing with PSM2 11.2.78).

The shm file opening completes correctly, i.e. this works without any error:

            dest_shmfd = shm_open(shmbuf, O_RDWR|O_CREAT|O_TRUNC, S_IRWXU);

The mmap also works, i.e. this works without any error:

    dest_mapptr = mmap(NULL, segsz,  PROT_READ | PROT_WRITE, MAP_SHARED, dest_shmfd, 0);
    dest_nodeinfo = (struct am_ctl_nodeinfo *)dest_mapptr;

However, any attempt to dereference eleents of dest_nodeinfo throws the error, i.e. this is the first place in the function this happens and the program crashes:

volatile uint16_t *is_init = &dest_nodeinfo->is_init;

Does this provide any pointers (apologies for the pun) on what's going wrong?

BrendanCunningham commented 3 years ago

Thanks for the update. Two ideas come to mind:

  1. PSM2_MULTIRAIL=1 is somehow causing this bug.
  2. Shared memory region is being removed out from under remote mapping process after mmap() succeeds but before dest_nodeinfo dereference.

We have some follow-up questions/requests:

  1. How reliably can you reproduce this issue?
  2. Can you share or point us to a reproducer?
  3. Can you run your job with '-x PSM2_TRACEMASK=0x40 -x HFI_DEBUG_FILENAME="/tmp/%h.%p.out"' and provide that output from a failing run?
  4. Can you run your job with '-x PSM2_MULTIRAIL=0' and report if it fails with the same/similar failure?
adrianjhpc commented 3 years ago

Thanks for the response.

  1. I can reproduce it reliably
  2. I can package up a reproducer if that's useful.
  3. I've attached the output of this
  4. Setting multi rail to 0 doesn't fix the issue.

psm2_debug.txt

jtfrey commented 2 years ago

I'm getting this same error, but ONLY when using MPI_Comm_spawn().

Open MPI 4.1.2 libpsm2-10.3.35-1.x86_64