Open lexming opened 4 years ago
One additional note: it seems that downgrading OFED to v4.5 would fix the broken MLX provider, based on the report in https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/842957
However, this is hardly a solution as that version is quite old. For instance, Cent OS 7 is on OFED v4.7 already.
I'm basically seeing the same issue during the h5py
sanity check (for #10160):
== 2020-03-22 20:10:38,425 build_log.py:169 ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:124 in __init__): Sanity check failed: command "/software/Python/3.8.2-GCCcore-9.3.0/bin/python -c "import h5py"" failed; output:
Abort(2140047) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1334):
MPIDU_bc_table_create(444)...: (at easybuild/framework/easyblock.py:2634 in _sanity_check_step)
More info:
$ ucx_info -v
# UCT version=1.5.1 revision 0000000
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check
$ fi_info
provider: mlx
fabric: mlx
domain: mlx
version: 1.5
type: FI_EP_UNSPEC
protocol: FI_PROTO_MLX
provider: mlx;ofi_rxm
fabric: mlx
domain: mlx
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
General purpose workaround: FI_PROVIDER=verbs,tcp
This setting for the libfabric
provider will use IB if it's available or fallback to TCP.
It should be possible to combine any number of providers as described in https://software.intel.com/en-us/mpi-developer-guide-linux-ofi-providers-support
@lexming But that's only advisable for systems with Infiniband though?
@boegel with the setting FI_PROVIDER=verbs,tcp
systems without IB will just fallback to TCP seamlessly. No error.
The setting that only works on systems with IB is FI_PROVIDER=verbs
.
It is also possible to use IMPI with an external libfabric
by setting I_MPI_OFI_LIBRARY_INTERNAL=0
.
Test with IMPI 2019.6 and libfabric-1.8.1
, which is the last upstream release that still has mlx
provider, does not work. The error is different but the mlx
provider still fails. It must be noted that the code for mlx
in libfabric-1.8.1
was released two years ago for UCX v1.3 and we have UCX v1.5 in our systems, with several changes and deprecated functions. So, in this case mlx
probably fails for different reasons than in the standard IMPI v2019.6. The point is that it fails and hence, it is not a usable alternative.
$ module load intel/2020a
$ module load libfabric/1.8.1-GCCcore-9.3.0
$ I_MPI_OFI_LIBRARY_INTERNAL=0 FI_PROVIDER=mlx ./test
Abort(1091471) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........:
MPID_Init(958)...............:
MPIDI_OFI_mpi_init_hook(1060): OFI fi_open domain failed (ofi_init.c:1060:MPIDI_OFI_mpi_init_hook:No data available)
Test with IMPI 2019.6 and libfabric-1.9.1
, which is the last upstream release, does work. However, the mlx
provider is no longer available, this works because at least all providers compatible with the host hardware are automatically enabled (eg, verbs, tcp, etc...).
$ module load intel/2020a
$ module load libfabric/1.9.1-GCCcore-9.3.0
$ FI_PROVIDER_PATH='' I_MPI_OFI_LIBRARY_INTERNAL=0 ./test
Hello world: rank 0 of 1 running on login2.cerberus.os
Therefore, this second test does not use mlx
and is similar to forcing IMPI 2019.6 with bundled libfabric-1.9.0a1-impi
to use verbs
or tcp
by setting FI_PROVIDER=verbs,tcp
.
In conclusion, we have two workaround solutions at our disposal:
Force IMPI v2019.6 with bundled libfabric-1.9.0a1-impi to use other providers, such as verbs or tcp. Requirements:
modextravars
with FI_PROVIDER=verbs,tcp
to impi-2019.6.166-iccifort-2020.0.166.eb
Use IMPI v2019.6 with external libfabric-1.9.1, which by default enables all compatible providers. Requirements:
impi-2019.6.166-iccifort-2020.0.166.eb
modextravars
to disable FI_PROVIDER_PATH
modextravars
with I_MPI_OFI_LIBRARY_INTERNAL=0
@lexming Thanks a lot for digging into this!
My preference goes to using the external libfabric 1.9.1, since that avoids "hardcoding" stuff to Infiniband via $FI_PROVIDER
.
Can you clarify why setting $FI_PROVIDER_PATH
is needed?
Also, should we reach out to Intel support on this, and try to get some feedback on the best way forward (and maybe also ask how they managed to overlook this issue)?
Regarding $FI_PROVIDER_PATH
, the easyblock impi
sets that path to the bundled providers shipped with IMPI in $EBROOTIMPI/intel64/libfabric/lib/prov
(as it should be).
In this case, $FI_PROVIDER_PATH
has to be unset to use the providers in the external libfabric-1.9.1
. And it is not necessary to set any other path because external libfabric
bundles the providers in the libfabric
library.
On our side, we will contact Intel about this issue. The real solution requires fixing IMPI v2019.6 as far as I can tell.
I tested on 2 clusters:
provider: mlx
fabric: mlx
domain: mlx
version: 1.5
type: FI_EP_UNSPEC
protocol: FI_PROTO_MLX
test works ok with srun -n 2 ./test
and with mpirun -n 2 ./test
(if I set UCX_IB_MLX5_DEVX=no
but I need that for Open MPI as well)
fabric: mlx
domain: mlx
version: 1.5
type: FI_EP_UNSPEC
protocol: FI_PROTO_MLX
provider: mlx;ofi_rxm
fabric: mlx
domain: mlx
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXM
works ok too! Note that the fi_info list for the second case is longer.
Also note that the easyblock for Intel MPI has a parameter ofi_internal = False
which you can use to disable that without needing to play with modextravars
.
Hmm, I get errors if I run ./test
without srun/mpirun, but if I use one of those it's ok.
@bartoldeman thank you for the feedback. This is very interesting, in my system executing the test with mpirun ./test
does indeed work.This is good news as this means that mlx
does work for inter-node jobs.
The reason I never tried mpirun
is that the origin of this issue are failed sanity checks of Python modules linking with MPI (eg h5py
and TensorFlow
). In those cases importing the respective module in Python initializes MPI before running any distributed execution (so no mpirun
) and those fail with the aforementioned errors.
If mlx
is working as intended, this is a change of behaviour compared to other providers such as tcp
, which can be used without mpirun
and it is equivalent to mpirun -n 1
.
@boegel the sanity check command of h5py
does work if called with mpirun
. Hence, the best solution seems to be to change the sanity check command of any Python module using MPI to
mpirun -n 1 python -c "import module_name"
@lexming So, there's actually no real problem, as long as mpirun
is used (which should always be used anyway, I think), Intel MPI 2019 update 6 works just fine?
@lexming h5py
import check fixed in #10246
@boegel yeah with mpirun
it seems to work just fine, but I have not done any extensive testing yet. On the side of Easybuild all that needs to be done is make sure that sanity checks of packages with mpi are done with mpirun
.
Keep in mind that using those packages (such as h5py
in intel/2020a
) will now require to use mpirun
at all times, which might break some user's workflows. But that is not an issue for Easybuild in my opinion.
There is some guidance about singleton MPI in the MPI standard https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node254.htm#Node254 but it has been problematic in my experience, sometimes it works, sometimes not. E.g. Open MPI on QLogic/Intel PSM infinipath, needed an environment variable.
I'm hitting a serious issue with mpiexec
using Intel MPI 2019 update 6 when running in Slurm jobs:
$ mpiexec -np 1 /bin/ls
Segmentation fault
See details in https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/807359#comment-1955057 .
@akesandgren has something in his hooks for this. I've borrowed this from his hooks in our local easyconfigs:
postinstallcmds = [
# Fix mpirun from IntelMPI to explicitly unset I_MPI_PMI_LIBRARY
# it can only be used with srun.
"sed -i 's@\\(#!/bin/sh.*\\)$@\\1\\nunset I_MPI_PMI_LIBRARY@' %(installdir)s/intel64/bin/mpirun",
]
I got feedback from Intel on this specific issue (precisely, the crashes of executables linking to Intel MPI 2019.6 if executed without mpiexec/mpirun
with the mlx
provider).
Intel support team has been able to reproduce the issue and they acknowledge it. They will escalate the issue and it should be fixed at some point.
We got a new reply from Intel regarding this issue
Our engineering team is planning to have this resolved in 2019 Update 8.
@lexming Can you ask them when they expect update 8 to be released?
Feel free to tell them that this is holding us back from going forward with intel/2020a
in EasyBuild, I'm sure that'll convince them to get their act together... ;)
@boegel done, I'll update this issue as soon as I get any reply
FYI, IMPI v2019.8.254 is released. I've tested a simple MPI code and it seems that the new release resolves the issue.
I confirm that the new update release IMPI v2019.8.254 fixes this issue. This can be tested with the easyconfig in https://github.com/easybuilders/easybuild-easyconfigs/pull/11337 .
Intel has introduced a new MLX provider for
libfabric
in IMPI v2019.6, the one used inintel/2020a
toolchain. More info: https://software.intel.com/en-us/articles/improve-performance-and-stability-with-intel-mpi-library-on-infinibandIssue: currently, all executables fail to initialize MPI with
intel/2020a
in our nodes with Mellanox cards.Steps to reproduce:
Use a system with a Mellanox card. Check that version 1.4 or higher of UCX is installed
Install and load
intel-2020a.eb
(I'll use the full toolchain for simplicity)Check if the provider of
libfabric
is listed asmlx
. This can be done with thefi_info
tool from IMPI v2019.6 inintel/2020a
.Compile end execute the minimal test program from IMPI v2019.6
Result: Output on our systems with Mellanox can be found at https://gist.github.com/lexming/fa6cd07bdb8e4d35be873b501935bb61
Solution/workaround: I have not found a solution to the failing MLX provider. Moreover, the official
libfabric
project has removed themlx
provider altogether since version 1.9 due to lack of maintenance (https://github.com/ofiwg/libfabric/pull/5281/commits/d8c8a2bc3f1c6de7d1507f3f0293c94c77a431ba). IMPI v2019.6 uses its own fork labelled1.9.0a1-impi
.The workaround is to switch to a different provider by setting the
FI_PROVIDER
environment variable. On a system with a Mellanox card this can be set totcp
orverbs
. Even though this works, it is not clear the performance impact of this change and it defeats the purpose of having a framework than can automatically detect the best transport layer.