cornelisnetworks / opa-psm2

Other
36 stars 29 forks source link

Collective message failure with PSM2 #60

Open patrick-legi opened 3 years ago

patrick-legi commented 3 years ago

Hi, for several weeks I try to understand a problem (wrong behavior) with fortran MPI_ALLTOALLW calls. The problem only occur on a Debian supercomputer using this opa-psm2 library for it's omni-path architecture. I, and 2 OpenMPI developpers, have tested many other achitectures (intel or amd cpu, with ethernet, omni-path or infiniband network and running RedHat or Suse OS. The problem do not occur in any of these tests. More over, if on the debian computer I build OpenMPI using --without-psm2 flag the problem do not occur but omni-path performances are not reached. I'm building OpenMPI 4.0.5 with gcc 6.3 or gcc 10.2 (same behavior)

Please find in attachement a really small test case showing the problem. If all runs fine it prints " Test pass!" else it shows the wrong values and calls mpi_abort(). To run this test:

  1. make
  2. mpirun -np 4 ./test_layout_array

Patrick

DEBUG.tar.gz

mwheinz commented 3 years ago

Patrick, you never replied to my last email. Are you using CUDA?

Also - please post a sample command line showing how you are exactly running this test case, showing successful and unsuccessful outputs and exactly what version of PSM2 you are using.

Intel never directly supported Debian and neither does Cornelis Networks but I will take another look when I get a chance.

patrick-legi commented 3 years ago

Hi Michael,

sorry to not answer this question, I ran so many test these lasts days, some of them with the admins of this cluster, that I have to reply with these details. They also ask me to open this issue. But no, I do not use cuda, these node do not have any GPU nor GPGPU.

I've spent also a few time to downgrade the CFD code and use point to point communications instead of mpi_alltoallw global communications for this purpose. It's less efficient but the code is back usable on this super-computer.

Le 05/02/2021 à 16:03, Michael Heinz a écrit :

Patrick, you never replied to my last email. Are you using CUDA?

Also - please post a sample command line showing how you are exactly running this test case, showing successful and unsuccessful outputs and exactly what version of PSM2 you are using.

Intel never directly supported Debian and neither does Cornelis Networks but I will take another look when I get a chance.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cornelisnetworks/opa-psm2/issues/60#issuecomment-774086781, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZWQHHR7GX5JLQKNODECP3S5QCCPANCNFSM4XEKM3OA.

patrick-legi commented 3 years ago

About this test,

   make
    mpirun -np 4 ./test_layout_array
If all runs fine it prints " Test pass!" else it shows the wrong values detected and calls mpi_abort().
The error also occurs with all processes on the same node
There is a README file in the archive for more details.

No GPU on the node, no multithread implementation (just MPI).

About PSM2 versions:

I've installed from github. It seams to be the same as the OS version as there no recent commits. The COMMIT file contains:

30c52a0fd155774e18cc06328a1ba83c2a6a8104

For the OS provided libraries (also tested):

dpkg -l |grep -i psm2
ii  libfabric-psm2                         1.10.0-2-1ifs+deb9                amd64        Dynamic PSM2 provider for user-space Open Fabric Interfaces
ii  libpsm2-2                              11.2.185-1-1ifs+deb9              amd64        Intel PSM2 Libraries
ii  libpsm2-2-compat                       11.2.185-1-1ifs+deb9              amd64        Compat library for Intel PSM2
ii  libpsm2-dev                            11.2.185-1-1ifs+deb9              amd64        Development files for Intel PSM2
ii  openmpi-gcc-hfi                        4.0.3-8-1ifs+deb9                 amd64        Powerful implementation of MPI/SHMEM with PSM2

Notice that I have also the bug with the OS deployed openmpi-gcc-hfi.

Patrick

mwheinz commented 3 years ago

Okay. I'll try to look at this today. You're running 4 ranks on only one host?

patrick-legi commented 3 years ago

Yes it is enought to show the problem. And the problem size is also very small: in the main program it is set to 5x7 points for easily tracking the problem with a debugger. Larger dimensions also show the problem. But of course the real CFD code runs In 3D with high resolutions on many nodes, this is just a test case showing the problem.

nsccap commented 3 years ago

I can reproduce your issue on our production platform with all of the OpenMPIs I tried (2.1.2, 3.1.2 and 4.0.5). Our system is CentOS7 based with libpsm2-11.2.78-1.el7. However, we run ~5 million jobs through this per year (most of them IntelMPI though) and we've had no issues reported by our quite wide user community. Are you sure your example is valid MPI?

patrick-legi commented 3 years ago

Hi Peter I ran also this code successfully on many architectures even if a bug in my CFD code and in this simple 2D test case is possible. Disabling the use of PSM2 on the Debian/omnipath cluster, the problem do not occur but I do not reach OPA high performances. When setting this minimal test case (4 process, really small resolution, 2D instead of 3D) the goal was to check all subarrays type parameters with gdb and I do not find something going wrong. Patrick

nsccap commented 3 years ago

Note that I said I COULD reproduce it. In fact I could not make it run successfully with OpenMPI and PSM2 in any way. It did run ok without PSM2 or with IntelMPI on PSM2. This however does not guarantee that the code is correct (I've not had time to analyze it myself).

patrick-legi commented 3 years ago

I agree, even such a small code may have a bug inside... even with my deep checks using gdb. Does Intel MPI uses the same PSM2 implementation than OpenMPI ? How can I help ?

nsccap commented 3 years ago

Well I'm just a systems expert that read your thread on openmpi-users and thought I'd help you out by contributing my testing results. Also, it's not until now I realized that you actually meant alltoallw (not typo for alltoallv). I can imagine that being bugged without people noticing. In fact, this is the first time I've heard of an application that uses it (I'm sure there are examples that I've missed though). Any way, alltoallw is not very common and probably sees very limited testing...

edit: yes my IntelMPI test was using the same PSM2. One can do "export PSM2_IDENTIFY=1" before mpirun to get runtime info on what is used.

mwheinz commented 3 years ago

Up until our spin off from Intel, yes, they did. I'm not sure what their plans are now.

mwheinz commented 3 years ago

Patrick,

I just tried your DEBUG package on my machines and I did get your error when I used PSM2 - but I got the same error when I used verbs, so I still don't know that this is a PSM2 issue. Here's what I did:

[cn-priv-01:~/work/STL-61275/DEBUG](N/A)$ mpirun --mca mtl_base_verbose 9 --mca mtl ofi --mca mtl_ofi_provider_exclude psm2 --mca FI_LOG_LEVEL info -np 4 ./test_layout_array
[cn-priv-01:1211573] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[cn-priv-01:1211573] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "psm2"
[cn-priv-01:1211573] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211573] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211573] mtl_ofi_component.c:347: mtl:ofi:prov: verbs;ofi_rxm
[cn-priv-01:1211575] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[cn-priv-01:1211575] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "psm2"
[cn-priv-01:1211575] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211575] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211575] mtl_ofi_component.c:347: mtl:ofi:prov: verbs;ofi_rxm
[cn-priv-01:1211574] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[cn-priv-01:1211574] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "psm2"
[cn-priv-01:1211574] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211574] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211574] mtl_ofi_component.c:347: mtl:ofi:prov: verbs;ofi_rxm
[cn-priv-01:1211576] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[cn-priv-01:1211576] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "psm2"
[cn-priv-01:1211576] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211576] mtl_ofi_component.c:336: mtl:ofi: "psm2" in exclude list
[cn-priv-01:1211576] mtl_ofi_component.c:347: mtl:ofi:prov: verbs;ofi_rxm
On 1 found 1007 but expect 3007
Test fails on process rank 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[cn-priv-01:1211569] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2193
On 2 found 1007 but expect 4007
Test fails on process rank 2
[cn-priv-01:1211576] mtl_ofi.h:511: fi_tsendddata failed: No route to host(-113)
[cn-priv-01:1211569] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[cn-priv-01:1211569] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
mwheinz commented 3 years ago

I get the same failure when using OFI sockets:

...

[cn-priv-01:~/work/STL-61275/DEBUG](N/A)$ mpirun --mca mtl_base_verbose 9 --mca mtl ofi --mca mtl_ofi_provider_include sockets -x FI_LOG_LEVEL=trace -x glibc.malloc.check=1 -np 4 ./test_layout_array
libfabric:1213328:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213328:core:core:fi_getinfo_():955<warn> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:1213328:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxm returned -61 (No data available)
libfabric:1213328:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxd returned -61 (No data available)
libfabric:1213328:ofi_mrail:fabric:mrail_get_core_info():289<warn> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:1213328:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_mrail returned -61 (No data available)
libfabric:1213329:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213329:core:core:fi_getinfo_():955<warn> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:1213329:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxm returned -61 (No data available)
libfabric:1213329:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxd returned -61 (No data available)
libfabric:1213329:ofi_mrail:fabric:mrail_get_core_info():289<warn> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:1213329:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_mrail returned -61 (No data available)
libfabric:1213327:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213330:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213327:core:core:fi_getinfo_():955<warn> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:1213327:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxm returned -61 (No data available)
libfabric:1213327:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxd returned -61 (No data available)
libfabric:1213327:ofi_mrail:fabric:mrail_get_core_info():289<warn> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:1213327:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_mrail returned -61 (No data available)
libfabric:1213330:core:core:fi_getinfo_():955<warn> fi_getinfo: provider usnic returned -61 (No data available)
libfabric:1213330:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxm returned -61 (No data available)
libfabric:1213330:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_rxd returned -61 (No data available)
libfabric:1213330:ofi_mrail:fabric:mrail_get_core_info():289<warn> OFI_MRAIL_ADDR_STRC env variable not set!
libfabric:1213330:core:core:fi_getinfo_():955<warn> fi_getinfo: provider ofi_mrail returned -61 (No data available)
libfabric:1213328:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213328:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213330:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213330:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213329:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213329:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213327:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213327:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213328:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213330:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213327:usnic:fabric:usdf_getinfo():763<trace> 
libfabric:1213329:usnic:fabric:usdf_getinfo():763<trace> 
[cn-priv-01:1213328] mtl_ofi_component.c:315: mtl:ofi:provider_include = "sockets"
[cn-priv-01:1213328] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:326: mtl:ofi: "shm" not in include list
[cn-priv-01:1213328] mtl_ofi_component.c:347: mtl:ofi:prov: sockets
[cn-priv-01:1213330] mtl_ofi_component.c:315: mtl:ofi:provider_include = "sockets"
[cn-priv-01:1213330] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:326: mtl:ofi: "shm" not in include list
[cn-priv-01:1213330] mtl_ofi_component.c:347: mtl:ofi:prov: sockets
[cn-priv-01:1213327] mtl_ofi_component.c:315: mtl:ofi:provider_include = "sockets"
[cn-priv-01:1213327] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:326: mtl:ofi: "shm" not in include list
[cn-priv-01:1213327] mtl_ofi_component.c:347: mtl:ofi:prov: sockets
[cn-priv-01:1213329] mtl_ofi_component.c:315: mtl:ofi:provider_include = "sockets"
[cn-priv-01:1213329] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "psm2" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "tcp;ofi_rxm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "psm2;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "verbs;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "UDP;ofi_rxd" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:326: mtl:ofi: "shm" not in include list
[cn-priv-01:1213329] mtl_ofi_component.c:347: mtl:ofi:prov: sockets
On 1 found 1007 but expect 3007
Test fails on process rank 1
On 2 found 1007 but expect 4007
Test fails on process rank 2
libfabric:1213328:sockets:ep_data:sock_rx_new_buffered_entry():109<warn> Exceeded buffered recv limit
libfabric:1213328:sockets:ep_data:sock_rx_new_buffered_entry():109<warn> Exceeded buffered recv limit
libfabric:1213329:sockets:ep_data:sock_rx_new_buffered_entry():109<warn> Exceeded buffered recv limit
libfabric:1213329:sockets:ep_data:sock_rx_new_buffered_entry():109<warn> Exceeded buffered recv limit
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[cn-priv-01:1213323] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[cn-priv-01:1213323] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cn-priv-01:~/work/STL-61275/DEBUG](N/A)$ 

...

mwheinz commented 3 years ago

I haven't used Fortran in ~20 years so I'm having trouble reading your sample app. What is the largest chunk of data that you send at one time?

patrick-legi commented 3 years ago

The largest chunck contains 4 elements, the smaller 1 element. The structure (variable called val) contains 2 arrays describing the organization of chunks:

  1. Y4layoutOnX(ncpus,2) when datas are stored along X axis, it is the Ymin,Ymax on each rank.
  2. X4layoutOnY(ncpus,2) when datas are stored along y axis, it is the Xmin,Xmax on each rank.

In fortran arrays are allocated with their real index in the global array. Ex: (1:nx, ymin:ymax). The test case switches from one organization to the other and back. alongy alongY

mwheinz commented 3 years ago

Well, that ruins that idea. Many transports have a maximum message size but don't enforce it, leading to data corruption - but you'd have to be sending 2 gigabytes or more in a single message for this to become a factor for PSM2.

mwheinz commented 3 years ago

Patrick, I'm going to continue to look at this when I can - but since I get the same error with verbs and with sockets, I really think you should move this to the OMPI repo.

patrick-legi commented 3 years ago

Thanks Michael for your help. I'll open an issue on OMPI soon, this week I have a lot of teaching hours to do, so may be at the end of the week. I will point also to this discusison.