IMPI v2019.6: MLX provider in libfabric not working

lexming commented 4 years ago

Intel has introduced a new MLX provider for libfabric in IMPI v2019.6, the one used in intel/2020a toolchain. More info: https://software.intel.com/en-us/articles/improve-performance-and-stability-with-intel-mpi-library-on-infiniband

Issue: currently, all executables fail to initialize MPI with intel/2020a in our nodes with Mellanox cards.

Steps to reproduce:

Use a system with a Mellanox card. Check that version 1.4 or higher of UCX is installed

$  ucx_info -v
# UCT version=1.5.1 revision 0000000
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check

Install and load intel-2020a.eb (I'll use the full toolchain for simplicity)

Check if the provider of libfabric is listed as mlx. This can be done with the fi_info tool from IMPI v2019.6 in intel/2020a.

$ fi_info
provider: mlx
fabric: mlx
domain: mlx
version: 1.5
type: FI_EP_UNSPEC
protocol: FI_PROTO_MLX
provider: mlx;ofi_rxm
fabric: mlx
domain: mlx
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM

Compile end execute the minimal test program from IMPI v2019.6

$ mpicc $EBROOTIMPI/test/test.c -o test
$ FI_LOG_LEVEL=debug ./test

Result: Output on our systems with Mellanox can be found at https://gist.github.com/lexming/fa6cd07bdb8e4d35be873b501935bb61

Solution/workaround: I have not found a solution to the failing MLX provider. Moreover, the official libfabric project has removed the mlx provider altogether since version 1.9 due to lack of maintenance (https://github.com/ofiwg/libfabric/pull/5281/commits/d8c8a2bc3f1c6de7d1507f3f0293c94c77a431ba). IMPI v2019.6 uses its own fork labelled 1.9.0a1-impi.

The workaround is to switch to a different provider by setting the FI_PROVIDER environment variable. On a system with a Mellanox card this can be set to tcp or verbs. Even though this works, it is not clear the performance impact of this change and it defeats the purpose of having a framework than can automatically detect the best transport layer.

lexming commented 4 years ago

One additional note: it seems that downgrading OFED to v4.5 would fix the broken MLX provider, based on the report in https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/842957

However, this is hardly a solution as that version is quite old. For instance, Cent OS 7 is on OFED v4.7 already.

boegel commented 4 years ago

I'm basically seeing the same issue during the h5py sanity check (for #10160):

== 2020-03-22 20:10:38,425 build_log.py:169 ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:124 in __init__): Sanity check failed: command "/software/Python/3.8.2-GCCcore-9.3.0/bin/python -c "import h5py"" failed; output:
Abort(2140047) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1334):
MPIDU_bc_table_create(444)...: (at easybuild/framework/easyblock.py:2634 in _sanity_check_step)

More info:

$ ucx_info -v
# UCT version=1.5.1 revision 0000000
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check

$ fi_info
provider: mlx
    fabric: mlx
    domain: mlx
    version: 1.5
    type: FI_EP_UNSPEC
    protocol: FI_PROTO_MLX
provider: mlx;ofi_rxm
    fabric: mlx
    domain: mlx
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXM

lexming commented 4 years ago

General purpose workaround: FI_PROVIDER=verbs,tcp This setting for the libfabric provider will use IB if it's available or fallback to TCP. It should be possible to combine any number of providers as described in https://software.intel.com/en-us/mpi-developer-guide-linux-ofi-providers-support

boegel commented 4 years ago

@lexming But that's only advisable for systems with Infiniband though?

lexming commented 4 years ago

@boegel with the setting FI_PROVIDER=verbs,tcp systems without IB will just fallback to TCP seamlessly. No error. The setting that only works on systems with IB is FI_PROVIDER=verbs.

lexming commented 4 years ago

It is also possible to use IMPI with an external libfabric by setting I_MPI_OFI_LIBRARY_INTERNAL=0.

Test with IMPI 2019.6 and libfabric-1.8.1, which is the last upstream release that still has mlx provider, does not work. The error is different but the mlx provider still fails. It must be noted that the code for mlx in libfabric-1.8.1 was released two years ago for UCX v1.3 and we have UCX v1.5 in our systems, with several changes and deprecated functions. So, in this case mlx probably fails for different reasons than in the standard IMPI v2019.6. The point is that it fails and hence, it is not a usable alternative.
```
$ module load intel/2020a
$ module load libfabric/1.8.1-GCCcore-9.3.0
$ I_MPI_OFI_LIBRARY_INTERNAL=0 FI_PROVIDER=mlx ./test                      
Abort(1091471) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........: 
MPID_Init(958)...............: 
MPIDI_OFI_mpi_init_hook(1060): OFI fi_open domain failed (ofi_init.c:1060:MPIDI_OFI_mpi_init_hook:No data available)
```
Test with IMPI 2019.6 and libfabric-1.9.1, which is the last upstream release, does work. However, the mlx provider is no longer available, this works because at least all providers compatible with the host hardware are automatically enabled (eg, verbs, tcp, etc...).
```
$ module load intel/2020a
$ module load libfabric/1.9.1-GCCcore-9.3.0
$ FI_PROVIDER_PATH='' I_MPI_OFI_LIBRARY_INTERNAL=0 ./test
Hello world: rank 0 of 1 running on login2.cerberus.os
```
Therefore, this second test does not use mlx and is similar to forcing IMPI 2019.6 with bundled libfabric-1.9.0a1-impi to use verbs or tcp by setting FI_PROVIDER=verbs,tcp.

In conclusion, we have two workaround solutions at our disposal:

Force IMPI v2019.6 with bundled libfabric-1.9.0a1-impi to use other providers, such as verbs or tcp. Requirements:
1. add modextravars with FI_PROVIDER=verbs,tcp to impi-2019.6.166-iccifort-2020.0.166.eb
Use IMPI v2019.6 with external libfabric-1.9.1, which by default enables all compatible providers. Requirements:
1. add libfabric-1.9.1-GCCcore-9.3.0 as dependency of impi-2019.6.166-iccifort-2020.0.166.eb
2. add modextravars to disable FI_PROVIDER_PATH
3. add modextravars with I_MPI_OFI_LIBRARY_INTERNAL=0

boegel commented 4 years ago

@lexming Thanks a lot for digging into this!

My preference goes to using the external libfabric 1.9.1, since that avoids "hardcoding" stuff to Infiniband via $FI_PROVIDER.

Can you clarify why setting $FI_PROVIDER_PATH is needed?

Also, should we reach out to Intel support on this, and try to get some feedback on the best way forward (and maybe also ask how they managed to overlook this issue)?

lexming commented 4 years ago

Regarding $FI_PROVIDER_PATH, the easyblock impi sets that path to the bundled providers shipped with IMPI in $EBROOTIMPI/intel64/libfabric/lib/prov (as it should be). In this case, $FI_PROVIDER_PATH has to be unset to use the providers in the external libfabric-1.9.1. And it is not necessary to set any other path because external libfabric bundles the providers in the libfabric library.

On our side, we will contact Intel about this issue. The real solution requires fixing IMPI v2019.6 as far as I can tell.

bartoldeman commented 4 years ago

I tested on 2 clusters:

Béluga with UCX 1.7.0, ConnectX-5, CentOS-7.7, no MOFED.

provider: mlx
fabric: mlx
domain: mlx
version: 1.5
type: FI_EP_UNSPEC
protocol: FI_PROTO_MLX

test works ok with srun -n 2 ./test and with mpirun -n 2 ./test (if I set UCX_IB_MLX5_DEVX=no but I need that for Open MPI as well)

Graham with UCX 1.7.0 ConnectX-4, CentOS-7.5, no MOFED

fabric: mlx
domain: mlx
version: 1.5
type: FI_EP_UNSPEC
protocol: FI_PROTO_MLX
provider: mlx;ofi_rxm
fabric: mlx
domain: mlx
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM

works ok too! Note that the fi_info list for the second case is longer.

Also note that the easyblock for Intel MPI has a parameter ofi_internal = False which you can use to disable that without needing to play with modextravars.

bartoldeman commented 4 years ago

Hmm, I get errors if I run ./test without srun/mpirun, but if I use one of those it's ok.

lexming commented 4 years ago

@bartoldeman thank you for the feedback. This is very interesting, in my system executing the test with mpirun ./test does indeed work.This is good news as this means that mlx does work for inter-node jobs.

The reason I never tried mpirun is that the origin of this issue are failed sanity checks of Python modules linking with MPI (eg h5py and TensorFlow). In those cases importing the respective module in Python initializes MPI before running any distributed execution (so no mpirun) and those fail with the aforementioned errors.

If mlx is working as intended, this is a change of behaviour compared to other providers such as tcp, which can be used without mpirun and it is equivalent to mpirun -n 1.

lexming commented 4 years ago

@boegel the sanity check command of h5py does work if called with mpirun. Hence, the best solution seems to be to change the sanity check command of any Python module using MPI to

mpirun -n 1 python -c "import module_name"

boegel commented 4 years ago

@lexming So, there's actually no real problem, as long as mpirun is used (which should always be used anyway, I think), Intel MPI 2019 update 6 works just fine?

boegel commented 4 years ago

@lexming h5py import check fixed in #10246

lexming commented 4 years ago

@boegel yeah with mpirun it seems to work just fine, but I have not done any extensive testing yet. On the side of Easybuild all that needs to be done is make sure that sanity checks of packages with mpi are done with mpirun.

Keep in mind that using those packages (such as h5py in intel/2020a) will now require to use mpirun at all times, which might break some user's workflows. But that is not an issue for Easybuild in my opinion.

bartoldeman commented 4 years ago

There is some guidance about singleton MPI in the MPI standard https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node254.htm#Node254 but it has been problematic in my experience, sometimes it works, sometimes not. E.g. Open MPI on QLogic/Intel PSM infinipath, needed an environment variable.

boegel commented 4 years ago

I'm hitting a serious issue with mpiexec using Intel MPI 2019 update 6 when running in Slurm jobs:

$ mpiexec -np 1 /bin/ls
Segmentation fault

See details in https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/807359#comment-1955057 .

bartoldeman commented 4 years ago

@akesandgren has something in his hooks for this. I've borrowed this from his hooks in our local easyconfigs:

postinstallcmds = [
    # Fix mpirun from IntelMPI to explicitly unset I_MPI_PMI_LIBRARY
    # it can only be used with srun.
    "sed -i 's@\\(#!/bin/sh.*\\)$@\\1\\nunset I_MPI_PMI_LIBRARY@' %(installdir)s/intel64/bin/mpirun",
]

lexming commented 4 years ago

I got feedback from Intel on this specific issue (precisely, the crashes of executables linking to Intel MPI 2019.6 if executed without mpiexec/mpirun with the mlx provider). Intel support team has been able to reproduce the issue and they acknowledge it. They will escalate the issue and it should be fixed at some point.

lexming commented 4 years ago

We got a new reply from Intel regarding this issue

Our engineering team is planning to have this resolved in 2019 Update 8.

boegel commented 4 years ago

@lexming Can you ask them when they expect update 8 to be released?

Feel free to tell them that this is holding us back from going forward with intel/2020a in EasyBuild, I'm sure that'll convince them to get their act together... ;)

lexming commented 4 years ago

@boegel done, I'll update this issue as soon as I get any reply

maxim-masterov commented 4 years ago

FYI, IMPI v2019.8.254 is released. I've tested a simple MPI code and it seems that the new release resolves the issue.

lexming commented 4 years ago

I confirm that the new update release IMPI v2019.8.254 fixes this issue. This can be tested with the easyconfig in https://github.com/easybuilders/easybuild-easyconfigs/pull/11337 .

easybuilders / easybuild-easyconfigs

IMPI v2019.6: MLX provider in libfabric not working #10213