intermittent data alignment failure on PSM2 for MPI_Recv

bartoldeman commented 2 years ago

The following program, also attached:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char *argv[]) {
   int i, p, rank, numprocs, iter;
   int ncoord = 86599;
   int maxiter = 100000;
   unsigned char *sendbuf, *recvbuf;
   MPI_Status status;

   MPI_Init(NULL, NULL);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Comm_size(MPI_COMM_WORLD, &numprocs);

   sendbuf = malloc(ncoord);
   recvbuf = malloc(numprocs*ncoord);

   /* monotonically increasing modulo 256 */
   for (i = 0; i < ncoord; i++)
     sendbuf[i] = i%256;

   /* gather sendbuf array from each process into recvbuf array */
   for (iter = 1; iter <= maxiter; iter++) {
     if (rank > 0)
       MPI_Send(sendbuf, ncoord, MPI_BYTE, 0, 0, MPI_COMM_WORLD);
     else {
       for (p = 1; p < numprocs; p++) {
         MPI_Recv(&recvbuf[p*ncoord], ncoord, MPI_BYTE, p, 0, MPI_COMM_WORLD, &status);
         for (i = 1; i < ncoord; i++) {
           int pos = p*ncoord+i;
           if (recvbuf[pos] != i%256) {
             fprintf(stderr, "unexpected byte received at address %p, index %d:\n"
                     "expected %d but received %d (prev %d) from process %d iter %d:\n",
                     &recvbuf[pos], i, i%256, recvbuf[pos], recvbuf[pos-1], p, iter);
             MPI_Abort(MPI_COMM_WORLD, -1);
           }
         }
       }
     }
   }
   MPI_Finalize();
}

occasionally fails on a large Omnipath cluster (Cedar), for all Open MPI versions we tested (2.1.1, 3.1.2, 4.0.3, 4.1.1), and Intel MPI 2021.2.0. Only if libfabric 1.12.1 is used as an intermediate via Open MPI's 4.1.1 OFI mtl it never seems to trigger.

if it fails we get output such as:

unexpected byte received at address 0x2b06a717effd, index 32764:
expected 252 but received 255 (prev 251) from process 7 iter 7:

so index 32764 has received what should have been received at index 32767 (and the following bytes are shifted similarly), suggesting that something was rounded down.

This issue doesn't trigger with shm, only hfi, and only for larger messages (ie. rendezvous protocol). I should translate this down to a direct PSM2-api using program, but if anyone has a hint from this program already, please let me know!

Here's a slurm submission script which triggers it (but not always!):

#!/bin/sh
#SBATCH --nodes=4
#SBATCH --ntasks=16
#SBATCH --mem=4G
#SBATCH --time=00-00:10

mpirun ../unalignedrecv

unalignedrecv.c.txt

This issue relates to https://github.com/mpi4py/mpi4py/issues/186 (the C program simulates MPI_Gather).

ddalessa commented 2 years ago

Thanks for the problem report. Someone will look into this.

bartoldeman commented 2 years ago

Note that I managed to narrow it down to direct use of PSM2, eliminating Open MPI on just two cores on two different nodes, example job script:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
#SBATCH --mem=4G
#SBATCH --time=00-00:10
#SBATCH --job-name=test
mpirun -n 2 ./psm2-demo

This is slightly modified version of the demo program from the documentation, where the receive buffer is unaligned, and it's retried many times. psm2-demo.c.txt

mv psm2-demo.c.txt psm2-demo.c
gcc -lpsm2 psm2-demo.c -o psm2-demo

if it fails we get messages such as:

...
PSM2 MQ init done.
PSM2 MQ send() done.
unexpected byte received at address 0x7ffd512757ff, index 42844:
expected 92 but received 93 (prev 91) from process 1 iter 22966:
unexpected byte received at address 0x7ffd51275800, index 42845:
expected 93 but received 94 (prev 93) from process 1 iter 22966:
unexpected byte received at address 0x7ffd51275801, index 42846:
expected 94 but received 95 (prev 94) from process 1 iter 22966:

sometimes the received results are shifted by 1 byte, sometimes 3.

bartoldeman commented 2 years ago

Update: I found out the reason for "Only if libfabric 1.12.1 is used as an intermediate via Open MPI's 4.1.1 OFI mtl it never seems to trigger.": this libfabric included the psm3 provider which provides all psm2 symbols internally inside libfabric.so (from prov/psm3/psm3 in the libfabric source code).

So we ended up using the ethernet network (as psm3 uses RoCE) which, while fixing the failure wasn't quite what we wanted on an omnipath cluster!

BrendanCunningham commented 2 years ago

Update: I found out the reason for "Only if libfabric 1.12.1 is used as an intermediate via Open MPI's 4.1.1 OFI mtl it never seems to trigger.": this libfabric included the psm3 provider which provides all psm2 symbols internally inside libfabric.so (from prov/psm3/psm3 in the libfabric source code).

Where did you get libfabric 1.12.1? Is it the libfabric that comes bundled with Cornelis' IFS/OPXS package?

When you ran with (OMPI 4.1.1, libfabric 1.12.1), did you set the FI_PROVIDER environment variable? You shouldn't have to but I'm curious if it still uses PSM3/RoCE if you set FI_PROVIDER=psm2 in your job environment.

bartoldeman commented 2 years ago

We used upstream libfabric. There was a bug starting and 1.12.0 and fixed in 1.15.0 that caused this symbol problem where even if we set FIPROVIDER=psm2 it would use psm3. 1.15.0 renamed all internal symbols to psm3* so is ok again.

With 1.15.1 now it defaults to the new opx provider which was surprising to me as the documentation says it's beta; FI_PROVIDER=psm2 does the right thing there though.

See https://github.com/ofiwg/libfabric/issues/7796 for more details.

BrendanCunningham commented 2 years ago

Note that I managed to narrow it down to direct use of PSM2, eliminating Open MPI on just two cores on two different nodes, example job script:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
#SBATCH --mem=4G
#SBATCH --time=00-00:10
#SBATCH --job-name=test
mpirun -n 2 ./psm2-demo
This is slightly modified version of the demo program from the documentation, where the receive buffer is unaligned, and it's retried many times. psm2-demo.c.txt
mv psm2-demo.c.txt psm2-demo.c
gcc -lpsm2 psm2-demo.c -o psm2-demo
if it fails we get messages such as:
...
PSM2 MQ init done.
PSM2 MQ send() done.
unexpected byte received at address 0x7ffd512757ff, index 42844:
expected 92 but received 93 (prev 91) from process 1 iter 22966:
unexpected byte received at address 0x7ffd51275800, index 42845:
expected 93 but received 94 (prev 93) from process 1 iter 22966:
unexpected byte received at address 0x7ffd51275801, index 42846:
expected 94 but received 95 (prev 94) from process 1 iter 22966:
sometimes the received results are shifted by 1 byte, sometimes 3.

Is the psm2-demo.c.txt reproducer code correct? It looks like your reproducer does the following in each test iteration:

Client initializes every element in msgbuf[i] = i % 256 and sends BUFFER_LENGTHbytes ofmsgbuf` to server.
Server receives BUFFER_LENGTH bytes from client into msgbuf.
Server verifies contents of msgbuf like so:

182                         for (int pos = 1; pos < BUFFER_LENGTH; pos++) {
183                                 if (msgbuf[pos + 3] != pos % 256) {
184                                         fprintf(stderr,
185                                                 "unexpected byte received at address %p, index %d:\n"
186                                                 "expected %d but received %d (prev %d) iter %d:\n",
187                                                 &msgbuf[pos + 3], pos,
188                                                 pos % 256, msgbuf[pos + 3],
189                                                 msgbuf[pos + 2], iter);
190                                 }
191                         }

Where BUFFER_LENGTH is defined as:

 19 #define BUFFER_LENGTH 86599 + 3

And msgbuf is defined in main() as:

 74         unsigned char msgbuf[BUFFER_LENGTH];

Two things about this strike me as off:

How can msgbuf[pos + 3] != pos % 256 ever be false?
pos goes from 1 to BUFFER_LENGTH - 1 but code accesses msgbuf[pos + 3], so that'll read past the end of msgbuf.

bartoldeman commented 2 years ago

My apologies, I introduced a last minute bug cleaning up the code. I'm attaching a fixed version with these changes, which have the effect of making the msgbuf an unaligned pointer:

--- psm2-demo.c~        2022-05-19 17:03:59.000000000 -0700
+++ psm2-demo.c 2022-06-02 07:28:41.000000000 -0700
@@ -16,7 +16,7 @@
 #include <string.h>
 #include <errno.h>
 #include <fcntl.h>
-#define BUFFER_LENGTH 86599+3
+#define BUFFER_LENGTH 86599
 #define CONNECT_ARRAY_SIZE 8

 void die(char *msg, int rc){
@@ -67,7 +67,8 @@
        int rc;
        int ver_major = PSM2_VERNO_MAJOR;
        int ver_minor = PSM2_VERNO_MINOR;
-       unsigned char msgbuf[BUFFER_LENGTH];
+       unsigned char msgbufbase[BUFFER_LENGTH+3];
+       unsigned char *msgbuf = &msgbufbase[3];
        psm2_mq_t q;
        psm2_mq_req_t req_mq;
        int is_server = 0;
@@ -176,10 +177,10 @@
                  die("couldn't wait for the irecv", rc);
                }
                for (int pos = 1; pos < BUFFER_LENGTH; pos++) {
-                 if (msgbuf[pos+3] != pos%256) {
+                 if (msgbuf[pos] != pos%256) {
                    fprintf(stderr, "unexpected byte received at address %p, index %d:\n"
                            "expected %d but received %d (prev %d) iter %d:\n",
-                           &msgbuf[pos+3], pos, pos%256, msgbuf[pos+3], msgbuf[pos+2], iter);
+                           &msgbuf[pos], pos, pos%256, msgbuf[pos], msgbuf[pos-1], iter);
                  }
                }
          } else {

psm2-demo.c.txt

I ran it again, and get messages such as

unexpected byte received at address 0x7ffd64093fff, index 54748:
expected 220 but received 221 (prev 219) iter 644539:
unexpected byte received at address 0x7ffd64094000, index 54749:
expected 221 but received 222 (prev 221) iter 644539:

and so on,

which means that in this run the first 644539 messages were received correctly, and only then were incorrect.

BrendanCunningham commented 2 years ago

@bartoldeman thanks for the updated psm2-demo reproducer.

I was not able to reproduce the error in 50 runs of psm2-demo on 2 nodes.

I have some questions I hope will help debug this issue:

Which version of PSM2 are you using?
How did you get PSM2 (IFS/OPXS install, distro, source build)?
1. If you built PSM2 from source, how did you build it (compiler version, custom CFLAGS)?
Which distro, kernel are you using?
CPU/system board/server model?
Are the OPA HFIs discrete HFIs or integrated (Xeon Phi, Skylake-F)?

Thanks.

bartoldeman commented 2 years ago

@bartoldeman thanks for the updated psm2-demo reproducer.

I was not able to reproduce the error in 50 runs of psm2-demo on 2 nodes.

I believe it may only happen if there are multiple runs (ie. multiple processes using the same OPA card) at the same time, at least it seems to happen here only on a busy cluster. I'll try on empty nodes to figure out how to reproduce it there. That said, I'll answer your questions.

I have some questions I hope will help debug this issue:
1. Which version of PSM2 are you using? 
2. How did you get PSM2 (IFS/OPXS install, distro, source build)?

I've tried with some different versions, compiled with various versions of GCC. To level the playing field however I've downloaded https://downloads.linux.hpe.com/SDR/repo/intel_opa/ifs/redhat/7.8/x86_64/10.11.1.3.1/libpsm2-11.2.228-1.x86_64.rpm and used the libpsm2.so from there, and could reproduce the issue.

3. Which distro, kernel are you using?

CentOS Linux release 7.9.2009 (Core)
Linux 3.10.0-1160.53.1.el7.x86_64 #1 SMP Fri Jan 14 13:59:45 UTC 2022 x86_64 GNU/Linux

some relevant info from modinfo hfi1:

filename:       /lib/modules/3.10.0-1160.53.1.el7.x86_64/extra/ifs-kernel-updates/hfi1.ko.xz
version:        10.11.0.1
description:    Intel Omni-Path Architecture driver
license:        Dual BSD/GPL
firmware:       hfi1_pcie.fw
firmware:       hfi1_sbus.fw
firmware:       hfi1_fabric.fw
firmware:       hfi1_dc8051.fw
retpoline:      Y
rhelversion:    7.9
srcversion:     0B7253581F7A7372A6FD8F1
alias:          pci:v00008086d000024F1sv*sd*bc*sc*i*
alias:          pci:v00008086d000024F0sv*sd*bc*sc*i*
depends:        rdmavt,ib_core,i2c-algo-bit
vermagic:       3.10.0-1160.53.1.el7.x86_64 SMP mod_unload modversions

4. CPU/system board/server model?

Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz (Broadwell) Dell PowerEdge C6320: https://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/Dell-PowerEdge-C6320-Spec-Sheet.pdf

5. Are the OPA HFIs discrete HFIs or integrated (Xeon Phi, Skylake-F)?

discrete (consistent with the spec sheet above) Intel Corporation Omni-Path HFI Silicon 100 Series

bartoldeman commented 2 years ago

I was able to reproduce on two otherwise idle nodes with 32 cores each with this script, where CLIENTNODE needs to be adjusted obviously to the name of the other node.

#!/bin/sh
CLIENTNODE=cdr808
for i in $(seq 16); do
  mkdir -p $i
  cd $i
  rm -f *
  ../psm2-demo -s &
  ssh $CLIENTNODE "cd $PWD && ../psm2-demo" &
  cd ..
done
wait

using fewer than 16 (in the seq) concurrent processes didn't trigger the issue so far, 16 or more did.

BrendanCunningham commented 2 years ago

Thanks. How many sockets do these servers have?

And the hfi1 module:

filename:       /lib/modules/3.10.0-1160.53.1.el7.x86_64/extra/ifs-kernel-updates/hfi1.ko.xz
version:        10.11.0.1

This came from an IFS/OPXS 10.11.0.1 install?

bartoldeman commented 2 years ago

Thanks. How many sockets do these servers have?

2 sockets (2x16 cores on the ones I tested)

And the hfi1 module:
filename:       /lib/modules/3.10.0-1160.53.1.el7.x86_64/extra/ifs-kernel-updates/hfi1.ko.xz
version:        10.11.0.1
This came from an IFS/OPXS 10.11.0.1 install?

Yes, according to the system administrator, the only alteration he did was to xz the driver and ensure debug symbols are stripped to save base image space.

BrendanCunningham commented 2 years ago

@bartoldeman Thank you for all of the information so far.

Unfortunately I haven’t been able to reproduce the issue despite configuring systems with identical versions and running the reproducer hundreds of times, both as described and with other settings.

I think we need to have a debug session or call on your systems to make progress on this problem. To proceed I need to have Cornelis Customer Support engaged. The best was to do this is to send an email referring to this issue to support@cornelisnetworks.com, including your contact information, and we will setup a call.

We should continue to communicate (e.g. like the questions below) through this GitHub issue as well though.

I have some questions to try to narrow down the paths in our software stack this problem occurs in.

How easily can you reproduce the issue, i.e. how many runs does it take to see one occurrence of the issue?
Does this issue occur with?:
1. (PIO send, eager receive)
  1. Can test by setting PSM2_MQ_EAGER_SDMA_SZ=1048576 PSM2_MQ_RNDV_HFI_THRESH=1048576 in process/job environment.
2. (SDMA send, eager receive)
  1. Can test by setting PSM2_MQ_RNDV_HFI_THRESH=1048576 in process/job environment.
Does the issue occur with?:
1. (aligned send buffer, aligned receive buffer)
2. (aligned send buffer, unaligned receive buffer)
3. (unaligned send buffer, aligned receive buffer)

BrendanCunningham commented 1 year ago

@bartoldeman is this still a problem?

bartoldeman commented 1 year ago

Hi @BrendanCunningham

thanks for the heads up. Yes the problem still occurs, but due to holidays I haven't been able to spend much time on it recently. I'll answer your questions here this week, but will coordinate with Cedar's site lead (Martin Siegert) to communicate with Cornelis Customer Support.

bartoldeman commented 1 year ago

Answers to questions @BrendanCunningham

How easily can you reproduce the issue, i.e. how many runs does it take to see one occurrence of the issue? strangely it seems to depend on the general state of the node. Earlier today I reserved two whole nodes and it triggered on the first run, but now on the same server (receiver) node but a different client (sender) it doesn't happen. I'll get back to you if I can figure out why.

edit: i triggered it on the new set of nodes after 5 runs.

Does this issue occur with?:

(PIO send, eager receive)

Can test by setting PSM2_MQ_EAGER_SDMA_SZ=1048576 PSM2_MQ_RNDV_HFI_THRESH=1048576 in process/job environment.

(SDMA send, eager receive)

Can test by setting PSM2_MQ_RNDV_HFI_THRESH=1048576 in process/job environment.

there is no issue with either of those settings, it needs to be a rendez-vous receive to trigger

Does the issue occur with?:

(aligned send buffer, aligned receive buffer) no

(aligned send buffer, unaligned receive buffer) yes

(unaligned send buffer, aligned receive buffer) no

BrendanCunningham commented 1 year ago

@bartoldeman Thanks. That seems to narrow down the problem the PSM2 expected receive path.

I'm analyzing the code now to debug the problem further (and fix it!). I'll post an update when I have more information and questions.

bartoldeman commented 1 year ago

Hi @BrendanCunningham

when I spent some time trying to follow the code myself I did see some code that may be related here:

https://github.com/cornelisnetworks/opa-psm2/blob/7cdb4e96ac2429f48f0091ddd1bba1c13925a75c/ptl_ips/ips_proto_expected.c#L1934

ie. it does play some games with alignment and my suspicion is that is something unexpected happens tsess_unaligned_start isn't taken into account properly somewhere (but no idea where!)

Do you still want access to the system? I've been in touch with the system administrators and I can sponsor you for an account, then the admins can give you a special reservation once it's ready. If so, I'll send the email to support.

BrendanCunningham commented 1 year ago

Hi @BrendanCunningham

when I spent some time trying to follow the code myself I did see some code that may be related here:

https://github.com/cornelisnetworks/opa-psm2/blob/7cdb4e96ac2429f48f0091ddd1bba1c13925a75c/ptl_ips/ips_proto_expected.c#L1934

ie. it does play some games with alignment and my suspicion is that is something unexpected happens tsess_unaligned_start isn't taken into account properly somewhere (but no idea where!)

Yes, the expected receive works on 4B or 64B offsets with paths to handle unaligned start/end. That is my suspicion as well and what I'm looking into.

Do you still want access to the system? I've been in touch with the system administrators and I can sponsor you for an account, then the admins can give you a special reservation once it's ready. If so, I'll send the email to support.

Yes, please get me access so I can try/debug on your system. Thanks.

BrendanCunningham commented 1 year ago

@bartoldeman I have identified the root cause and developed a fix.

Having identified the root cause on your systems, I am now able to reproduce the problem on our systems.

I have validated this fix against the reproducer that you provided on both your systems and ours.

Please pull and build the issue-64-Psm2UnalignedRecvFix branch, try it, and report whether it works for you. If possible, please try it with your original mpi4py application as well.

If the fix also works for you, we'll merge it into opa-psm2/master.

bartoldeman commented 1 year ago

Thanks! I will test it today and tomorrow and let you know.

bartoldeman commented 1 year ago

I can confirm that this fixes the psm2-demo test case as well as the original mpi4py test case with pickled MPI_Allgather on two and four nodes.

Thanks again for fixing this tough bug!

cornelisnetworks / opa-psm2

intermittent data alignment failure on PSM2 for MPI_Recv #64