Closed bartoldeman closed 1 year ago
Thanks for the problem report. Someone will look into this.
Note that I managed to narrow it down to direct use of PSM2, eliminating Open MPI on just two cores on two different nodes, example job script:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
#SBATCH --mem=4G
#SBATCH --time=00-00:10
#SBATCH --job-name=test
mpirun -n 2 ./psm2-demo
This is slightly modified version of the demo program from the documentation, where the receive buffer is unaligned, and it's retried many times. psm2-demo.c.txt
mv psm2-demo.c.txt psm2-demo.c
gcc -lpsm2 psm2-demo.c -o psm2-demo
if it fails we get messages such as:
...
PSM2 MQ init done.
PSM2 MQ send() done.
unexpected byte received at address 0x7ffd512757ff, index 42844:
expected 92 but received 93 (prev 91) from process 1 iter 22966:
unexpected byte received at address 0x7ffd51275800, index 42845:
expected 93 but received 94 (prev 93) from process 1 iter 22966:
unexpected byte received at address 0x7ffd51275801, index 42846:
expected 94 but received 95 (prev 94) from process 1 iter 22966:
sometimes the received results are shifted by 1 byte, sometimes 3.
Update: I found out the reason for "Only if libfabric 1.12.1 is used as an intermediate via Open MPI's 4.1.1 OFI mtl it never seems to trigger.": this libfabric included the psm3 provider which provides all psm2 symbols internally inside libfabric.so (from prov/psm3/psm3 in the libfabric source code).
So we ended up using the ethernet network (as psm3 uses RoCE) which, while fixing the failure wasn't quite what we wanted on an omnipath cluster!
Update: I found out the reason for "Only if libfabric 1.12.1 is used as an intermediate via Open MPI's 4.1.1 OFI mtl it never seems to trigger.": this libfabric included the psm3 provider which provides all psm2 symbols internally inside libfabric.so (from prov/psm3/psm3 in the libfabric source code).
Where did you get libfabric 1.12.1? Is it the libfabric that comes bundled with Cornelis' IFS/OPXS package?
When you ran with (OMPI 4.1.1, libfabric 1.12.1), did you set the FI_PROVIDER
environment variable? You shouldn't have to but I'm curious if it still uses PSM3/RoCE if you set FI_PROVIDER=psm2
in your job environment.
We used upstream libfabric. There was a bug starting and 1.12.0 and fixed in 1.15.0 that caused this symbol problem where even if we set FIPROVIDER=psm2 it would use psm3. 1.15.0 renamed all internal symbols to psm3* so is ok again.
With 1.15.1 now it defaults to the new opx
provider which was surprising to me as the documentation says it's beta; FI_PROVIDER=psm2 does the right thing there though.
See https://github.com/ofiwg/libfabric/issues/7796 for more details.
Note that I managed to narrow it down to direct use of PSM2, eliminating Open MPI on just two cores on two different nodes, example job script:
#!/bin/bash #SBATCH --nodes=2 #SBATCH --tasks-per-node=1 #SBATCH --mem=4G #SBATCH --time=00-00:10 #SBATCH --job-name=test mpirun -n 2 ./psm2-demo
This is slightly modified version of the demo program from the documentation, where the receive buffer is unaligned, and it's retried many times. psm2-demo.c.txt
mv psm2-demo.c.txt psm2-demo.c gcc -lpsm2 psm2-demo.c -o psm2-demo
if it fails we get messages such as:
... PSM2 MQ init done. PSM2 MQ send() done. unexpected byte received at address 0x7ffd512757ff, index 42844: expected 92 but received 93 (prev 91) from process 1 iter 22966: unexpected byte received at address 0x7ffd51275800, index 42845: expected 93 but received 94 (prev 93) from process 1 iter 22966: unexpected byte received at address 0x7ffd51275801, index 42846: expected 94 but received 95 (prev 94) from process 1 iter 22966:
sometimes the received results are shifted by 1 byte, sometimes 3.
Is the psm2-demo.c.txt reproducer code correct? It looks like your reproducer does the following in each test iteration:
msgbuf[i] = i % 256
and sends BUFFER_LENGTHbytes of
msgbuf` to server.BUFFER_LENGTH
bytes from client into msgbuf
.msgbuf
like so:182 for (int pos = 1; pos < BUFFER_LENGTH; pos++) {
183 if (msgbuf[pos + 3] != pos % 256) {
184 fprintf(stderr,
185 "unexpected byte received at address %p, index %d:\n"
186 "expected %d but received %d (prev %d) iter %d:\n",
187 &msgbuf[pos + 3], pos,
188 pos % 256, msgbuf[pos + 3],
189 msgbuf[pos + 2], iter);
190 }
191 }
Where BUFFER_LENGTH
is defined as:
19 #define BUFFER_LENGTH 86599 + 3
And msgbuf
is defined in main()
as:
74 unsigned char msgbuf[BUFFER_LENGTH];
Two things about this strike me as off:
msgbuf[pos + 3] != pos % 256
ever be false?pos
goes from 1 to BUFFER_LENGTH - 1
but code accesses msgbuf[pos + 3]
, so that'll read past the end of msgbuf
.My apologies, I introduced a last minute bug cleaning up the code. I'm attaching a fixed version with these changes, which have the effect of making the msgbuf an unaligned pointer:
--- psm2-demo.c~ 2022-05-19 17:03:59.000000000 -0700
+++ psm2-demo.c 2022-06-02 07:28:41.000000000 -0700
@@ -16,7 +16,7 @@
#include <string.h>
#include <errno.h>
#include <fcntl.h>
-#define BUFFER_LENGTH 86599+3
+#define BUFFER_LENGTH 86599
#define CONNECT_ARRAY_SIZE 8
void die(char *msg, int rc){
@@ -67,7 +67,8 @@
int rc;
int ver_major = PSM2_VERNO_MAJOR;
int ver_minor = PSM2_VERNO_MINOR;
- unsigned char msgbuf[BUFFER_LENGTH];
+ unsigned char msgbufbase[BUFFER_LENGTH+3];
+ unsigned char *msgbuf = &msgbufbase[3];
psm2_mq_t q;
psm2_mq_req_t req_mq;
int is_server = 0;
@@ -176,10 +177,10 @@
die("couldn't wait for the irecv", rc);
}
for (int pos = 1; pos < BUFFER_LENGTH; pos++) {
- if (msgbuf[pos+3] != pos%256) {
+ if (msgbuf[pos] != pos%256) {
fprintf(stderr, "unexpected byte received at address %p, index %d:\n"
"expected %d but received %d (prev %d) iter %d:\n",
- &msgbuf[pos+3], pos, pos%256, msgbuf[pos+3], msgbuf[pos+2], iter);
+ &msgbuf[pos], pos, pos%256, msgbuf[pos], msgbuf[pos-1], iter);
}
}
} else {
I ran it again, and get messages such as
unexpected byte received at address 0x7ffd64093fff, index 54748:
expected 220 but received 221 (prev 219) iter 644539:
unexpected byte received at address 0x7ffd64094000, index 54749:
expected 221 but received 222 (prev 221) iter 644539:
and so on,
which means that in this run the first 644539 messages were received correctly, and only then were incorrect.
@bartoldeman thanks for the updated psm2-demo reproducer.
I was not able to reproduce the error in 50 runs of psm2-demo on 2 nodes.
I have some questions I hope will help debug this issue:
Thanks.
@bartoldeman thanks for the updated psm2-demo reproducer.
I was not able to reproduce the error in 50 runs of psm2-demo on 2 nodes.
I believe it may only happen if there are multiple runs (ie. multiple processes using the same OPA card) at the same time, at least it seems to happen here only on a busy cluster. I'll try on empty nodes to figure out how to reproduce it there. That said, I'll answer your questions.
I have some questions I hope will help debug this issue:
1. Which version of PSM2 are you using? 2. How did you get PSM2 (IFS/OPXS install, distro, source build)?
I've tried with some different versions, compiled with various versions of GCC. To level the playing field however I've downloaded https://downloads.linux.hpe.com/SDR/repo/intel_opa/ifs/redhat/7.8/x86_64/10.11.1.3.1/libpsm2-11.2.228-1.x86_64.rpm and used the libpsm2.so from there, and could reproduce the issue.
3. Which distro, kernel are you using?
CentOS Linux release 7.9.2009 (Core) Linux 3.10.0-1160.53.1.el7.x86_64 #1 SMP Fri Jan 14 13:59:45 UTC 2022 x86_64 GNU/Linux
some relevant info from
modinfo hfi1
:filename: /lib/modules/3.10.0-1160.53.1.el7.x86_64/extra/ifs-kernel-updates/hfi1.ko.xz version: 10.11.0.1 description: Intel Omni-Path Architecture driver license: Dual BSD/GPL firmware: hfi1_pcie.fw firmware: hfi1_sbus.fw firmware: hfi1_fabric.fw firmware: hfi1_dc8051.fw retpoline: Y rhelversion: 7.9 srcversion: 0B7253581F7A7372A6FD8F1 alias: pci:v00008086d000024F1sv*sd*bc*sc*i* alias: pci:v00008086d000024F0sv*sd*bc*sc*i* depends: rdmavt,ib_core,i2c-algo-bit vermagic: 3.10.0-1160.53.1.el7.x86_64 SMP mod_unload modversions
4. CPU/system board/server model?
Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz (Broadwell) Dell PowerEdge C6320: https://i.dell.com/sites/doccontent/shared-content/data-sheets/en/Documents/Dell-PowerEdge-C6320-Spec-Sheet.pdf
5. Are the OPA HFIs discrete HFIs or integrated (Xeon Phi, Skylake-F)?
discrete (consistent with the spec sheet above) Intel Corporation Omni-Path HFI Silicon 100 Series
I was able to reproduce on two otherwise idle nodes with 32 cores each with this script, where CLIENTNODE needs to be adjusted obviously to the name of the other node.
#!/bin/sh
CLIENTNODE=cdr808
for i in $(seq 16); do
mkdir -p $i
cd $i
rm -f *
../psm2-demo -s &
ssh $CLIENTNODE "cd $PWD && ../psm2-demo" &
cd ..
done
wait
using fewer than 16 (in the seq
) concurrent processes didn't trigger the issue so far, 16 or more did.
Thanks. How many sockets do these servers have?
And the hfi1 module:
filename: /lib/modules/3.10.0-1160.53.1.el7.x86_64/extra/ifs-kernel-updates/hfi1.ko.xz
version: 10.11.0.1
This came from an IFS/OPXS 10.11.0.1 install?
Thanks. How many sockets do these servers have?
2 sockets (2x16 cores on the ones I tested)
And the hfi1 module:
filename: /lib/modules/3.10.0-1160.53.1.el7.x86_64/extra/ifs-kernel-updates/hfi1.ko.xz version: 10.11.0.1
This came from an IFS/OPXS 10.11.0.1 install?
Yes, according to the system administrator, the only alteration he did was to xz the driver and ensure debug symbols are stripped to save base image space.
@bartoldeman Thank you for all of the information so far.
Unfortunately I haven’t been able to reproduce the issue despite configuring systems with identical versions and running the reproducer hundreds of times, both as described and with other settings.
I think we need to have a debug session or call on your systems to make progress on this problem. To proceed I need to have Cornelis Customer Support engaged. The best was to do this is to send an email referring to this issue to support@cornelisnetworks.com, including your contact information, and we will setup a call.
We should continue to communicate (e.g. like the questions below) through this GitHub issue as well though.
I have some questions to try to narrow down the paths in our software stack this problem occurs in.
PSM2_MQ_EAGER_SDMA_SZ=1048576 PSM2_MQ_RNDV_HFI_THRESH=1048576
in process/job environment.PSM2_MQ_RNDV_HFI_THRESH=1048576
in process/job environment.@bartoldeman is this still a problem?
Hi @BrendanCunningham
thanks for the heads up. Yes the problem still occurs, but due to holidays I haven't been able to spend much time on it recently. I'll answer your questions here this week, but will coordinate with Cedar's site lead (Martin Siegert) to communicate with Cornelis Customer Support.
Answers to questions @BrendanCunningham
- How easily can you reproduce the issue, i.e. how many runs does it take to see one occurrence of the issue? strangely it seems to depend on the general state of the node. Earlier today I reserved two whole nodes and it triggered on the first run, but now on the same server (receiver) node but a different client (sender) it doesn't happen. I'll get back to you if I can figure out why.
edit: i triggered it on the new set of nodes after 5 runs.
Does this issue occur with?:
- (PIO send, eager receive)
- Can test by setting
PSM2_MQ_EAGER_SDMA_SZ=1048576 PSM2_MQ_RNDV_HFI_THRESH=1048576
in process/job environment.(SDMA send, eager receive)
- Can test by setting
PSM2_MQ_RNDV_HFI_THRESH=1048576
in process/job environment.
there is no issue with either of those settings, it needs to be a rendez-vous receive to trigger
Does the issue occur with?:
- (aligned send buffer, aligned receive buffer) no
- (aligned send buffer, unaligned receive buffer) yes
- (unaligned send buffer, aligned receive buffer) no
@bartoldeman Thanks. That seems to narrow down the problem the PSM2 expected receive path.
I'm analyzing the code now to debug the problem further (and fix it!). I'll post an update when I have more information and questions.
Hi @BrendanCunningham
when I spent some time trying to follow the code myself I did see some code that may be related here:
ie. it does play some games with alignment and my suspicion is that is something unexpected happens tsess_unaligned_start
isn't taken into account properly somewhere (but no idea where!)
Do you still want access to the system? I've been in touch with the system administrators and I can sponsor you for an account, then the admins can give you a special reservation once it's ready. If so, I'll send the email to support.
Hi @BrendanCunningham
when I spent some time trying to follow the code myself I did see some code that may be related here:
ie. it does play some games with alignment and my suspicion is that is something unexpected happens
tsess_unaligned_start
isn't taken into account properly somewhere (but no idea where!)
Yes, the expected receive works on 4B or 64B offsets with paths to handle unaligned start/end. That is my suspicion as well and what I'm looking into.
Do you still want access to the system? I've been in touch with the system administrators and I can sponsor you for an account, then the admins can give you a special reservation once it's ready. If so, I'll send the email to support.
Yes, please get me access so I can try/debug on your system. Thanks.
@bartoldeman I have identified the root cause and developed a fix.
Having identified the root cause on your systems, I am now able to reproduce the problem on our systems.
I have validated this fix against the reproducer that you provided on both your systems and ours.
Please pull and build the issue-64-Psm2UnalignedRecvFix branch, try it, and report whether it works for you. If possible, please try it with your original mpi4py application as well.
If the fix also works for you, we'll merge it into opa-psm2/master.
Thanks! I will test it today and tomorrow and let you know.
I can confirm that this fixes the psm2-demo
test case as well as the original mpi4py
test case with pickled MPI_Allgather
on two and four nodes.
Thanks again for fixing this tough bug!
The following program, also attached:
occasionally fails on a large Omnipath cluster (Cedar), for all Open MPI versions we tested (2.1.1, 3.1.2, 4.0.3, 4.1.1), and Intel MPI 2021.2.0. Only if libfabric 1.12.1 is used as an intermediate via Open MPI's 4.1.1 OFI mtl it never seems to trigger.
if it fails we get output such as:
so index 32764 has received what should have been received at index 32767 (and the following bytes are shifted similarly), suggesting that something was rounded down.
This issue doesn't trigger with shm, only hfi, and only for larger messages (ie. rendezvous protocol). I should translate this down to a direct PSM2-api using program, but if anyone has a hint from this program already, please let me know!
Here's a slurm submission script which triggers it (but not always!):
unalignedrecv.c.txt
This issue relates to https://github.com/mpi4py/mpi4py/issues/186 (the C program simulates MPI_Gather).