EESSI / software-layer

Software layer of the EESSI project
https://eessi.github.io/docs/software_layer
GNU General Public License v2.0
24 stars 47 forks source link

OpenMPI complaining about OFI call #136

Open hmeiland opened 3 years ago

hmeiland commented 3 years ago

While running the OSU benchmarks on a single system (CentOS Linux release 7.9.2009 (Core)), OpenMPI is giving the following errors:

[EESSI pilot 2021.06] $ mpirun -n 2 osu_bw
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: headnode
  Location: mtl_ofi_component.c:610
  Error: No data available (61)
--------------------------------------------------------------------------

This can be prevented by pointing OpenMPI to use UCX (which is loaded):

[EESSI pilot 2021.06] $ ml list

Currently Loaded Modules:
  1) GCCcore/9.3.0                  4) libxml2/2.9.10-GCCcore-9.3.0      7) libevent/2.1.11-GCCcore-9.3.0   10) PMIx/3.1.5-GCCcore-9.3.0  13) OSU-Micro-Benchmarks/5.6.3-gompi-2020a
  2) GCC/9.3.0                      5) libpciaccess/0.16-GCCcore-9.3.0   8) UCX/1.8.0-GCCcore-9.3.0         11) OpenMPI/4.0.3-GCC-9.3.0
  3) numactl/2.0.13-GCCcore-9.3.0   6) hwloc/2.2.0-GCCcore-9.3.0         9) libfabric/1.11.0-GCCcore-9.3.0  12) gompi/2020a

Adding the export OMPI_MCA_pml=ucx will prevent this:

[EESSI pilot 2021.06] $ export OMPI_MCA_pml=ucx
[EESSI pilot 2021.06] $ mpirun -n 2 osu_bw
# OSU MPI Bandwidth Test v5.6.3
# Size      Bandwidth (MB/s)
1                      10.54
2                      22.05
4                      44.18
8                      87.96
16                    176.31
32                    323.08
64                    687.36
128                   834.63
256                  1550.03
512                  2412.94
1024                 3708.65
2048                 5959.67
4096                 6954.11
8192                 9147.57
16384                7750.81
32768               10947.38
65536               13826.51
131072              16181.14
262144              17932.76
524288              16478.82
1048576             13073.89
2097152             10085.20
4194304              9312.80

Can this variable be set in the OpenMPI module?

ocaisa commented 3 years ago

Setting export OMPI_MCA_pml=ucx assumes that UCX is the only game in town, but it's not, we also have support in there for cm through which you can use OmniPath and libfabric. The EFA fabric on AWS, for example, requires libfabric.

The (relatively recent) EasyBuild tech talks on OpenMPI do a good job of covering this: https://github.com/easybuilders/easybuild/wiki/EasyBuild-tech-talks-I:-Open-MPI

I'm not sure we can get away from people having to use additional settings to get something like OpenMPI to work as expected for their particular environment.

hmeiland commented 3 years ago

ok, so it looks like we need an archspec.interconnect

hmeiland commented 3 years ago

would an `if test -d /sys/class/infiniband; then export OMPI_MCA_pml=ucx; fi be enough? In lua something like path.exists("/sys/class/infiniband")? Looks like this path is not enough; but when checking for /sys/class/infiniband/mlx5_ib0 it could work...

wpoely86 commented 3 years ago

A similar one: to use srun with the EESSI, you need to make sure that SLURM_MPI_TYPE is set to the correct value (pmix normally).

A mechanism to inject these variable in EESSI would be useful?

ocaisa commented 3 years ago

We discussed this recently. One way is to rebuild the module tree using your preferred MNS and have a hook that injects the envvars needed to use MPI effectively/easily.

Another was to have a conditionally loaded module called something like mpi-settings which could be symlinked into EESSI somewhere under the existing host_injections. That could house these kinds of envvars.

wpoely86 commented 3 years ago

I'm a big fan of the second option. It gives a lot of freedom to tweak per site.

boegel commented 3 years ago

Just had a quick call with @hmeiland on this, he's working on interconnect detection support in archspec, which could help with avoiding the original reported problem: if archspec reports that the system is using IB, then we could automatically set $OMPI_MCA_pml to ucx, and avoid the need for site-specific tweaking.

Maybe something similar can be done for Slurm too. In both cases, I think it makes sense to have a way of opting out too (just in case our detection does something silly/wrong).

One thing that puzzles me is why the libfabric issue happens at all: is that because Open MPI sees that libfabric is available, tries to use or query it (first, before UCX), fails utterly, and then "falls back" to UCX? I thought/expected that Open MPI was more intelligent than that, like check first which interconnect is available, and then select either libfabric or UCX? I wonder if @jsquyres can provide some more insight into that? (happy to clarify the context here if needed)

Providing an easy way to let sites do site-specific tweaks to the EESSI environment through a conditionally loaded module makes a lot of sense too, of course, since that indeed gives a lot of freedom. @ocaisa Can you open an issue on that if we don't have one yet? That seems like something we can easily support in the next pilot version (or even in the existing 2021.06, since it's just a small tweak to the init scripts).

jsquyres commented 3 years ago

Do you know which interface the OFI component was complaining about?

hmeiland commented 3 years ago

Do you know which interface the OFI component was complaining about?

I'm expecting the ib0 to be used:

[EESSI pilot 2021.06] $ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 60:45:bd:87:94:e7 brd ff:ff:ff:ff:ff:ff inet 10.0.16.9/20 brd 10.0.31.255 scope global noprefixroute eth0 valid_lft forever preferred_lft forever inet6 fe80::6245:bdff:fe87:94e7/64 scope link valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:15:5d:33:ff:1a brd ff:ff:ff:ff:ff:ff 4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256 link/infiniband 00:00:09:28:fe:80:00:00:00:00:00:00:00:15:5d:ff:fd:33:ff:1a brd 00:ff:ff:ff:ff:12:40:1b:80:14:00:00:00:00:00:00:ff:ff:ff:ff inet 172.16.1.17/16 brd 172.16.255.255 scope global ib0 valid_lft forever preferred_lft forever inet6 fe80::215:5dff:fd33:ff1a/64 scope link valid_lft forever preferred_lft forever

31d8:00:02.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] Subsystem: Mellanox Technologies Device 0003 Physical Slot: 2 Flags: bus master, fast devsel, latency 0, NUMA node 0 Memory at fe0000000 (64-bit, prefetchable) [size=32M] Capabilities: Kernel driver in use: mlx5_core Kernel modules: mlx5_core

jsquyres commented 3 years ago

ib0 is an IPoIB interface; it is not suitable for native IB traffic. ib0 is really only intended for applications that do not use Verbs / UCX (or OFI) directly; it's intended for apps that only know how to speak TCP/UDP via the POSIX API calls. Specifically: ib0 is an emulation layer on top of native IB, and therefore adds latency and loses bandwidth compared to the native IB stack. You'll need to talk to NVIDIA for details.

I can't tell what version of Open MPI you're using from this github issue. Is it Open MPI v4.1.x?

What I was trying to ask in my prior comment was: the OFI component is complaining about a specific interface (I would doubt that it is ib0 -- it is likely some other native IB interface). Do you know which one it is complaining about? Try running with mpirun --mca mtl_base_verbose 100 ... and see if there's any useful input in there.

I agree that disabling the CM PML (and therefore the OFI MTL) will avoid your problem. But I also agree with @boegel that Open MPI should behave better than that, such that you wouldn't need to manually disable CM/OFI.

That being said, what version of libfabric do you have? Perhaps it's an old libfabric that is incorrectly querying IB interfaces...?

hmeiland commented 3 years ago

[EESSI pilot 2021.06] $ mpirun --version mpirun (Open MPI) 4.0.3

libfabric/1.11.0

The Mellanox ConnectX5 is the physical device, and is available through the kernel as /sys/class/infiniband/mlx5_ib0

I'm not seeing any specific interface being referenced below.....

[EESSI pilot 2021.06] $ mpirun --mca mtl_base_verbose 100 -n 2 osu_bw
[ip-0A001009:17523] mca: base: components_register: registering framework mtl components
[ip-0A001009:17523] mca: base: components_register: found loaded component ofi
[ip-0A001009:17524] mca: base: components_register: registering framework mtl components
[ip-0A001009:17524] mca: base: components_register: found loaded component ofi
[ip-0A001009:17524] mca: base: components_register: component ofi register function successful
[ip-0A001009:17524] mca: base: components_register: found loaded component psm2
[ip-0A001009:17523] mca: base: components_register: component ofi register function successful
[ip-0A001009:17523] mca: base: components_register: found loaded component psm2
[ip-0A001009:17524] mca: base: components_register: component psm2 register function successful
[ip-0A001009:17523] mca: base: components_register: component psm2 register function successful
[ip-0A001009:17524] mca: base: components_open: opening mtl components
[ip-0A001009:17524] mca: base: components_open: found loaded component ofi
[ip-0A001009:17523] mca: base: components_open: opening mtl components
[ip-0A001009:17523] mca: base: components_open: found loaded component ofi
[ip-0A001009:17523] mca: base: components_open: component ofi open function successful
[ip-0A001009:17524] mca: base: components_open: component ofi open function successful
[ip-0A001009:17524] mca: base: components_open: found loaded component psm2
[ip-0A001009:17523] mca: base: components_open: found loaded component psm2
[ip-0A001009:17523] mca: base: close: component psm2 closed
[ip-0A001009:17523] mca: base: close: unloading component psm2
[ip-0A001009:17524] mca: base: close: component psm2 closed
[ip-0A001009:17524] mca: base: close: unloading component psm2
[ip-0A001009:17523] mca:base:select: Auto-selecting mtl components
[ip-0A001009:17523] mca:base:select:(  mtl) Querying component [ofi]
[ip-0A001009:17523] mca:base:select:(  mtl) Query of component [ofi] set priority to 25
[ip-0A001009:17523] mca:base:select:(  mtl) Selected component [ofi]
[ip-0A001009:17523] select: initializing mtl component ofi
[ip-0A001009:17524] mca:base:select: Auto-selecting mtl components
[ip-0A001009:17524] mca:base:select:(  mtl) Querying component [ofi]
[ip-0A001009:17524] mca:base:select:(  mtl) Query of component [ofi] set priority to 25
[ip-0A001009:17524] mca:base:select:(  mtl) Selected component [ofi]
[ip-0A001009:17524] select: initializing mtl component ofi
[ip-0A001009:17523] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[ip-0A001009:17523] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[ip-0A001009:17523] mtl_ofi_component.c:347: mtl:ofi:prov: verbs;ofi_rxm
[ip-0A001009:17524] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[ip-0A001009:17524] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[ip-0A001009:17524] mtl_ofi_component.c:347: mtl:ofi:prov: verbs;ofi_rxm
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: ip-0A001009
  Location: mtl_ofi_component.c:610
  Error: No data available (61)
--------------------------------------------------------------------------
[ip-0A001009:17523] select: init returned failure for component ofi
[ip-0A001009:17523] select: no component selected
[ip-0A001009:17524] select: init returned failure for component ofi
[ip-0A001009:17524] select: no component selected
[ip-0A001009:17523] mca: base: close: component ofi closed
[ip-0A001009:17523] mca: base: close: unloading component ofi
[ip-0A001009:17524] mca: base: close: component ofi closed
[ip-0A001009:17524] mca: base: close: unloading component ofi
# OSU MPI Bandwidth Test v5.6.3
# Size      Bandwidth (MB/s)
1                      11.04
2                      22.56
4                      44.36
8                      91.57
16                    179.17
32                    361.96
64                    750.71
128                   808.43
256                  1530.46
512                  2444.58
1024                 3866.79
2048                 6090.55
4096                 7274.32
8192                 9583.67
16384                8143.13
32768               11699.30
65536               14597.15
131072              16974.20
262144              18190.47
524288              17385.82
1048576             13818.62
2097152             10492.79
4194304             10047.21
[ip-0A001009:17519] 1 more process has sent help message help-mtl-ofi.txt / OFI call fail
[ip-0A001009:17519] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Thanks for helping here!!

jsquyres commented 3 years ago

Let me inquire with others in the Open MPI community and get back to you.

Can you try upgrading to Open MPI v4.1.1? I have a very dim (and possibly incorrect) recollection that we fixed an issue that sounds like this since v4.0.x.

hmeiland commented 3 years ago

Let me inquire with others in the Open MPI community and get back to you.

Can you try upgrading to Open MPI v4.1.1? I have a very dim (and possibly incorrect) recollection that we fixed an issue that sounds like this since v4.0.x.

I'll try the upgrade!

jsquyres commented 3 years ago

I'll try the upgrade!

Thanks. If the upgrade doesn't fix it, please run with export FI_LOG_LEVEL=info (and possibly mpirun -x FI_LOG_LEVEL ...), which will tell libfabric.so to emit lots of juicy debug info, which might give us a little more insight.

boegel commented 3 years ago

Let me inquire with others in the Open MPI community and get back to you. Can you try upgrading to Open MPI v4.1.1? I have a very dim (and possibly incorrect) recollection that we fixed an issue that sounds like this since v4.0.x.

I'll try the upgrade!

@hmeiland Let me know if you need help with that, it's pretty easy for us to ingest a couple of additional software installations in the 2021.06 version of the EESSI pilot to allow testing of this (especially if we just do it for a single specific CPU target)

hmeiland commented 3 years ago

Looks like it is solved in the OpenMPI 4.1.1 with libfabric 1.12.1, which are included in EasyBuild gompi/2021a

$ mpirun --version
mpirun (Open MPI) 4.1.1

Report bugs to http://www.open-mpi.org/community/help/
$ mpirun -n 2 osu_bw
# OSU MPI Bandwidth Test v5.7.1
# Size      Bandwidth (MB/s)
1                      14.53
2                      29.09
4                      59.38
8                     118.41
16                    238.19
32                    466.83
64                    984.38
128                  1029.14
256                  1781.63
512                  2766.15
1024                 4144.41
2048                 6388.11
4096                 8656.42
8192                10748.83
16384                6012.96
32768                9456.37
65536               14436.65
131072              19507.37
262144              24342.00
524288              25778.23
1048576             20787.43
2097152             13602.21
4194304             12378.99

Will this be a suitable candidate toolchain to move EESSI towards?

boegel commented 3 years ago

@hmeiland Yes, definitely. We hope/plan to jump to newer toolchains in the next EESSI pilot version.

Is it only fixed in gompi/2021a, or also in gompi/2020b?

hmeiland commented 3 years ago

only gompi/2021a ; not in gompi/2020b (which is OpenMPI 4.0.5), tested with both...

boegel commented 3 years ago

@jsquyres So it seems like the problem is indeed gone with Open MPI 4.1...

Do you happen to have any further pointers in where this was fixed? Would it be doable to backport that to Open MPI 4.0.x?

jsquyres commented 3 years ago

I don't know offhand where it was fixed, I'm sorry. It should be easy to test if it has been fixed in the 4.0.x series -- you can try the latest 4.0.x nightly snapshot tarball from here: https://www.open-mpi.org/nightly/v4.0.x/

robogast commented 10 months ago

Just a small FYI: ran into this on Snellius while running https://github.com/NVIDIA/nccl-tests, using easybuild NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1 + OpenMPI/4.1.5-GCC-12.3.0 . The export OMPI_MCA_pml=ucx fixed the error.

ocaisa commented 10 months ago

@robogast We currently don't ship NCCL with EESSI (though it's not far away), so I'm guessing you are getting this from somewhere else? I think you may be hitting something related to how EasyBuild builds UCX, and in particular the way we implement the CUDA plugin for UCX. This is only for UCX, not libfabric, so it may be the case that you need to specify in that scenario

robogast commented 10 months ago

@ocaisa no I haven't run into this issue through EESSI, it was on our cluster which uses EasyBuild. A quick google search resulted in me finding this issue, and I just wanted to log that I am running into the same issue.

Let me know if you need NCCL ReFrame tests, I've just created them for our cluster :)

niktre commented 8 months ago

exactly like @robogast we don't run into this issue with EESSI, but use EasyBuild on our cluster.

we are seeing the same problem:

Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: gpu0201
  Location: mtl_ofi_component.c:939
  Error: No data available (61)

with mpi/OpenMPI/4.1.4-GCC-12.2.0 and system/CUDA/12.1.0