Open hmeiland opened 3 years ago
Setting export OMPI_MCA_pml=ucx
assumes that UCX is the only game in town, but it's not, we also have support in there for cm
through which you can use OmniPath and libfabric. The EFA fabric on AWS, for example, requires libfabric.
The (relatively recent) EasyBuild tech talks on OpenMPI do a good job of covering this: https://github.com/easybuilders/easybuild/wiki/EasyBuild-tech-talks-I:-Open-MPI
I'm not sure we can get away from people having to use additional settings to get something like OpenMPI to work as expected for their particular environment.
ok, so it looks like we need an archspec.interconnect
would an `if test -d /sys/class/infiniband; then export OMPI_MCA_pml=ucx; fi be enough? In lua something like path.exists("/sys/class/infiniband")? Looks like this path is not enough; but when checking for /sys/class/infiniband/mlx5_ib0 it could work...
A similar one: to use srun
with the EESSI, you need to make sure that SLURM_MPI_TYPE
is set to the correct value (pmix
normally).
A mechanism to inject these variable in EESSI would be useful?
We discussed this recently. One way is to rebuild the module tree using your preferred MNS and have a hook that injects the envvars needed to use MPI effectively/easily.
Another was to have a conditionally loaded module called something like mpi-settings
which could be symlinked into EESSI somewhere under the existing host_injections
. That could house these kinds of envvars.
I'm a big fan of the second option. It gives a lot of freedom to tweak per site.
Just had a quick call with @hmeiland on this, he's working on interconnect detection support in archspec
, which could help with avoiding the original reported problem: if archspec
reports that the system is using IB, then we could automatically set $OMPI_MCA_pml
to ucx
, and avoid the need for site-specific tweaking.
Maybe something similar can be done for Slurm too. In both cases, I think it makes sense to have a way of opting out too (just in case our detection does something silly/wrong).
One thing that puzzles me is why the libfabric issue happens at all: is that because Open MPI sees that libfabric is available, tries to use or query it (first, before UCX), fails utterly, and then "falls back" to UCX? I thought/expected that Open MPI was more intelligent than that, like check first which interconnect is available, and then select either libfabric or UCX? I wonder if @jsquyres can provide some more insight into that? (happy to clarify the context here if needed)
Providing an easy way to let sites do site-specific tweaks to the EESSI environment through a conditionally loaded module makes a lot of sense too, of course, since that indeed gives a lot of freedom. @ocaisa Can you open an issue on that if we don't have one yet? That seems like something we can easily support in the next pilot version (or even in the existing 2021.06
, since it's just a small tweak to the init
scripts).
Do you know which interface the OFI component was complaining about?
Do you know which interface the OFI component was complaining about?
I'm expecting the ib0 to be used:
[EESSI pilot 2021.06] $ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 60:45:bd:87:94:e7 brd ff:ff:ff:ff:ff:ff inet 10.0.16.9/20 brd 10.0.31.255 scope global noprefixroute eth0 valid_lft forever preferred_lft forever inet6 fe80::6245:bdff:fe87:94e7/64 scope link valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:15:5d:33:ff:1a brd ff:ff:ff:ff:ff:ff 4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256 link/infiniband 00:00:09:28:fe:80:00:00:00:00:00:00:00:15:5d:ff:fd:33:ff:1a brd 00:ff:ff:ff:ff:12:40:1b:80:14:00:00:00:00:00:00:ff:ff:ff:ff inet 172.16.1.17/16 brd 172.16.255.255 scope global ib0 valid_lft forever preferred_lft forever inet6 fe80::215:5dff:fd33:ff1a/64 scope link valid_lft forever preferred_lft forever
31d8:00:02.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
Subsystem: Mellanox Technologies Device 0003
Physical Slot: 2
Flags: bus master, fast devsel, latency 0, NUMA node 0
Memory at fe0000000 (64-bit, prefetchable) [size=32M]
Capabilities:
ib0
is an IPoIB interface; it is not suitable for native IB traffic. ib0
is really only intended for applications that do not use Verbs / UCX (or OFI) directly; it's intended for apps that only know how to speak TCP/UDP via the POSIX API calls. Specifically: ib0
is an emulation layer on top of native IB, and therefore adds latency and loses bandwidth compared to the native IB stack. You'll need to talk to NVIDIA for details.
I can't tell what version of Open MPI you're using from this github issue. Is it Open MPI v4.1.x?
What I was trying to ask in my prior comment was: the OFI component is complaining about a specific interface (I would doubt that it is ib0
-- it is likely some other native IB interface). Do you know which one it is complaining about? Try running with mpirun --mca mtl_base_verbose 100 ...
and see if there's any useful input in there.
I agree that disabling the CM PML (and therefore the OFI MTL) will avoid your problem. But I also agree with @boegel that Open MPI should behave better than that, such that you wouldn't need to manually disable CM/OFI.
That being said, what version of libfabric do you have? Perhaps it's an old libfabric that is incorrectly querying IB interfaces...?
[EESSI pilot 2021.06] $ mpirun --version mpirun (Open MPI) 4.0.3
libfabric/1.11.0
The Mellanox ConnectX5 is the physical device, and is available through the kernel as /sys/class/infiniband/mlx5_ib0
I'm not seeing any specific interface being referenced below.....
[EESSI pilot 2021.06] $ mpirun --mca mtl_base_verbose 100 -n 2 osu_bw
[ip-0A001009:17523] mca: base: components_register: registering framework mtl components
[ip-0A001009:17523] mca: base: components_register: found loaded component ofi
[ip-0A001009:17524] mca: base: components_register: registering framework mtl components
[ip-0A001009:17524] mca: base: components_register: found loaded component ofi
[ip-0A001009:17524] mca: base: components_register: component ofi register function successful
[ip-0A001009:17524] mca: base: components_register: found loaded component psm2
[ip-0A001009:17523] mca: base: components_register: component ofi register function successful
[ip-0A001009:17523] mca: base: components_register: found loaded component psm2
[ip-0A001009:17524] mca: base: components_register: component psm2 register function successful
[ip-0A001009:17523] mca: base: components_register: component psm2 register function successful
[ip-0A001009:17524] mca: base: components_open: opening mtl components
[ip-0A001009:17524] mca: base: components_open: found loaded component ofi
[ip-0A001009:17523] mca: base: components_open: opening mtl components
[ip-0A001009:17523] mca: base: components_open: found loaded component ofi
[ip-0A001009:17523] mca: base: components_open: component ofi open function successful
[ip-0A001009:17524] mca: base: components_open: component ofi open function successful
[ip-0A001009:17524] mca: base: components_open: found loaded component psm2
[ip-0A001009:17523] mca: base: components_open: found loaded component psm2
[ip-0A001009:17523] mca: base: close: component psm2 closed
[ip-0A001009:17523] mca: base: close: unloading component psm2
[ip-0A001009:17524] mca: base: close: component psm2 closed
[ip-0A001009:17524] mca: base: close: unloading component psm2
[ip-0A001009:17523] mca:base:select: Auto-selecting mtl components
[ip-0A001009:17523] mca:base:select:( mtl) Querying component [ofi]
[ip-0A001009:17523] mca:base:select:( mtl) Query of component [ofi] set priority to 25
[ip-0A001009:17523] mca:base:select:( mtl) Selected component [ofi]
[ip-0A001009:17523] select: initializing mtl component ofi
[ip-0A001009:17524] mca:base:select: Auto-selecting mtl components
[ip-0A001009:17524] mca:base:select:( mtl) Querying component [ofi]
[ip-0A001009:17524] mca:base:select:( mtl) Query of component [ofi] set priority to 25
[ip-0A001009:17524] mca:base:select:( mtl) Selected component [ofi]
[ip-0A001009:17524] select: initializing mtl component ofi
[ip-0A001009:17523] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[ip-0A001009:17523] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[ip-0A001009:17523] mtl_ofi_component.c:347: mtl:ofi:prov: verbs;ofi_rxm
[ip-0A001009:17524] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[ip-0A001009:17524] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[ip-0A001009:17524] mtl_ofi_component.c:347: mtl:ofi:prov: verbs;ofi_rxm
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: ip-0A001009
Location: mtl_ofi_component.c:610
Error: No data available (61)
--------------------------------------------------------------------------
[ip-0A001009:17523] select: init returned failure for component ofi
[ip-0A001009:17523] select: no component selected
[ip-0A001009:17524] select: init returned failure for component ofi
[ip-0A001009:17524] select: no component selected
[ip-0A001009:17523] mca: base: close: component ofi closed
[ip-0A001009:17523] mca: base: close: unloading component ofi
[ip-0A001009:17524] mca: base: close: component ofi closed
[ip-0A001009:17524] mca: base: close: unloading component ofi
# OSU MPI Bandwidth Test v5.6.3
# Size Bandwidth (MB/s)
1 11.04
2 22.56
4 44.36
8 91.57
16 179.17
32 361.96
64 750.71
128 808.43
256 1530.46
512 2444.58
1024 3866.79
2048 6090.55
4096 7274.32
8192 9583.67
16384 8143.13
32768 11699.30
65536 14597.15
131072 16974.20
262144 18190.47
524288 17385.82
1048576 13818.62
2097152 10492.79
4194304 10047.21
[ip-0A001009:17519] 1 more process has sent help message help-mtl-ofi.txt / OFI call fail
[ip-0A001009:17519] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Thanks for helping here!!
Let me inquire with others in the Open MPI community and get back to you.
Can you try upgrading to Open MPI v4.1.1? I have a very dim (and possibly incorrect) recollection that we fixed an issue that sounds like this since v4.0.x.
Let me inquire with others in the Open MPI community and get back to you.
Can you try upgrading to Open MPI v4.1.1? I have a very dim (and possibly incorrect) recollection that we fixed an issue that sounds like this since v4.0.x.
I'll try the upgrade!
I'll try the upgrade!
Thanks. If the upgrade doesn't fix it, please run with export FI_LOG_LEVEL=info
(and possibly mpirun -x FI_LOG_LEVEL ...
), which will tell libfabric.so to emit lots of juicy debug info, which might give us a little more insight.
Let me inquire with others in the Open MPI community and get back to you. Can you try upgrading to Open MPI v4.1.1? I have a very dim (and possibly incorrect) recollection that we fixed an issue that sounds like this since v4.0.x.
I'll try the upgrade!
@hmeiland Let me know if you need help with that, it's pretty easy for us to ingest a couple of additional software installations in the 2021.06 version of the EESSI pilot to allow testing of this (especially if we just do it for a single specific CPU target)
Looks like it is solved in the OpenMPI 4.1.1 with libfabric 1.12.1, which are included in EasyBuild gompi/2021a
$ mpirun --version
mpirun (Open MPI) 4.1.1
Report bugs to http://www.open-mpi.org/community/help/
$ mpirun -n 2 osu_bw
# OSU MPI Bandwidth Test v5.7.1
# Size Bandwidth (MB/s)
1 14.53
2 29.09
4 59.38
8 118.41
16 238.19
32 466.83
64 984.38
128 1029.14
256 1781.63
512 2766.15
1024 4144.41
2048 6388.11
4096 8656.42
8192 10748.83
16384 6012.96
32768 9456.37
65536 14436.65
131072 19507.37
262144 24342.00
524288 25778.23
1048576 20787.43
2097152 13602.21
4194304 12378.99
Will this be a suitable candidate toolchain to move EESSI towards?
@hmeiland Yes, definitely. We hope/plan to jump to newer toolchains in the next EESSI pilot version.
Is it only fixed in gompi/2021a
, or also in gompi/2020b
?
only gompi/2021a
; not in gompi/2020b
(which is OpenMPI 4.0.5), tested with both...
@jsquyres So it seems like the problem is indeed gone with Open MPI 4.1...
Do you happen to have any further pointers in where this was fixed? Would it be doable to backport that to Open MPI 4.0.x?
I don't know offhand where it was fixed, I'm sorry. It should be easy to test if it has been fixed in the 4.0.x series -- you can try the latest 4.0.x nightly snapshot tarball from here: https://www.open-mpi.org/nightly/v4.0.x/
Just a small FYI: ran into this on Snellius while running https://github.com/NVIDIA/nccl-tests, using easybuild NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1
+ OpenMPI/4.1.5-GCC-12.3.0
.
The export OMPI_MCA_pml=ucx
fixed the error.
@robogast We currently don't ship NCCL
with EESSI (though it's not far away), so I'm guessing you are getting this from somewhere else? I think you may be hitting something related to how EasyBuild builds UCX, and in particular the way we implement the CUDA plugin for UCX. This is only for UCX, not libfabric, so it may be the case that you need to specify in that scenario
@ocaisa no I haven't run into this issue through EESSI
, it was on our cluster which uses EasyBuild.
A quick google search resulted in me finding this issue, and I just wanted to log that I am running into the same issue.
Let me know if you need NCCL ReFrame tests, I've just created them for our cluster :)
exactly like @robogast we don't run into this issue with EESSI
, but use EasyBuild on our cluster.
we are seeing the same problem:
Open MPI failed an OFI Libfabric library call (fi_domain). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: gpu0201
Location: mtl_ofi_component.c:939
Error: No data available (61)
with mpi/OpenMPI/4.1.4-GCC-12.2.0
and system/CUDA/12.1.0
While running the OSU benchmarks on a single system (CentOS Linux release 7.9.2009 (Core)), OpenMPI is giving the following errors:
This can be prevented by pointing OpenMPI to use UCX (which is loaded):
Adding the
export OMPI_MCA_pml=ucx
will prevent this:Can this variable be set in the OpenMPI module?