eth-cscs / sarus

OCI-compatible engine to deploy Linux containers on HPC environments.
https://sarus.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
127 stars 10 forks source link

RDMA failed to open device #19

Open NicholasRasi opened 3 years ago

NicholasRasi commented 3 years ago

Hello, I am trying to run some MPI benchmarks with Sarus containers. In particular I am using OpenMPI 4. Nodes are RDMA capable and have Infiniband. Everything works fine without the container and if I run ibv_devinfo on the host I got:

hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         16.26.0206
        node_guid:                      0015:5dff:fe33:ff0d
        sys_image_guid:                 506b:4b03:00fb:f03a
        vendor_id:                      0x02c9
        vendor_part_id:                 4120
        hw_ver:                         0x0
        board_id:                       MT_0000000010
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               700
                        port_lmc:               0x00
                        link_layer:             InfiniBand

But if I run it inside a container I got Failed to open device. So, I tried to mount the device with a bind but it does not work without sudo:

[user@controller1 ~]$ sarus run --mount=src=/dev/infiniband/uverbs0,dst=/dev/infiniband/uverbs0,type=bind nichr/hpc-bench:v2 bash
[895.208658764] [controller1-5327] [main] [ERROR] Error trace (most nested error last):
#0   createFoldersIfNecessary at "Utility.cpp":437 Failed to create directory "/opt/sarus/1.3.0-Release/var/OCIBundleDir/rootfs/dev/infiniband"
#1   "unknown function" at "unknown file":-1 boost::filesystem::create_directory: Permission denied: "/opt/sarus/1.3.0-Release/var/OCIBundleDir/rootfs/dev/infiniband"

On the other hand, it works with sudo and the device is recognized inside the container.

1. Is there any other way to mount the device without sudo?

The guide reports that I need to use the SSH hook in order to run OpenMPI. But if I launch sarus with sudo, mount and srun:

[user@controller1 sarus]$ srun sudo /opt/sarus/1.3.0-Release/bin/sarus run --ssh --mount=src=/dev/infiniband/uverbs0,dst=/dev/infiniband/uverbs0,type=bind nichr/hpc-bench:v2 bash -c 'if [ $SLURM_PROCID -eq 0 ]; then mpirun -npernode 1 --allow-run-as-root --map-by node -mca pml ucx --mca btl ^vader,tcp,openib -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_IB_PKEY=$UCX_IB_PKEY /opt/benchmarks/mpiBench/mpiBench -e 1K; else sleep infinity; fi'

I got:

bash: line 0: [: -eq: unary operator expected
bash: line 0: [: -eq: unary operator expected

2. If I use OpenMPI I need the SSH hook, am I right?


I have created the container with the following Dockerfile:

FROM centos:7.6.1810

# set up base
RUN yum install -y epel-release \
    && yum groupinstall -y "Development tools" \
    && yum install -y \
        libusbx pciutils-libs pciutils lsof ethtool fuse-libs \
        ca-certificates wget openssh-server openssh-clients net-tools \
        numactl-devel gtk2 atk cairo tcsh libnl3 tcl libmnl tk

# set up workdir
ENV INSTALL_PREFIX=/opt
WORKDIR /tmp/mpi

# download and install mlnx
RUN wget -q -O - http://content.mellanox.com/ofed/MLNX_OFED-5.1-0.6.6.0/MLNX_OFED_LINUX-5.1-0.6.6.0-rhel7.6-x86_64.tgz | tar -xzf - \
    && ./MLNX_OFED_LINUX-5.1-0.6.6.0-rhel7.6-x86_64/mlnxofedinstall --user-space-only --without-fw-update --all --force \
    && rm -rf MLNX_OFED_LINUX-5.1-0.6.6.0-rhel7.6-x86_64

# download and install HPC-X
ENV HPCX_VERSION="v2.7.0"
RUN cd ${INSTALL_PREFIX} && \
    wget -q -O - https://azhpcstor.blob.core.windows.net/azhpc-images-store/hpcx-v2.7.0-gcc9.2.0-MLNX_OFED_LINUX-5.1-0.6.6.0-redhat7.6-x86_64.tbz | tar -xjf - \
    && HPCX_PATH=${INSTALL_PREFIX}/hpcx-${HPCX_VERSION}-gcc-MLNX_OFED_LINUX-5.1-0.6.6.0-redhat7.6-x86_64 \
    && HCOLL_PATH=${HPCX_PATH}/hcoll \
    && UCX_PATH=${HPCX_PATH}/ucx

# download and install OpenMPI
ENV OMPI_VERSION="4.0.4"
RUN wget -q -O - https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-${OMPI_VERSION}.tar.gz | tar -xzf - \
    && cd openmpi-${OMPI_VERSION} \
    && ./configure --with-ucx=${UCX_PATH} --with-hcoll=${HCOLL_PATH} --enable-mpirun-prefix-by-default \
    && make -j 8 && make install \
    && cd .. \
    && rm -rf openmpi-${OMPI_VERSION} 

# install and setup benchmarks
WORKDIR /opt/benchmarks

# download and install mpiBench
RUN wget -q -O - https://codeload.github.com/LLNL/mpiBench/tar.gz/master | tar -xzf - \
    && mv ./mpiBench-master ./mpiBench \
    && cd mpiBench/ \
    && make

# download and install osu micro benchmarks
RUN wget -q -O - http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.6.3.tar.gz | tar -xzf - \
    && mv ./osu-micro-benchmarks-5.6.3 ./osu-micro-benchmarks \
    && cd osu-micro-benchmarks/ \
    && ./configure CC=mpicc CXX=mpicxx \
    && make \
    && make install

I am new to Sarus and HPC world, thank you for your support!

Madeeks commented 3 years ago

Hello @NicholasRasi, thank you for opening this issue.

  1. Is there any other way to mount the device without sudo? We are looking into this behavior and will let you know more as soon as possible.

  2. If I use OpenMPI I need the SSH hook, am I right? The error you are getting is related to the Bash syntax of your command. If I'm understanding things correctly, the $SLURM_PROCID variable is not defined and the -eq operator returns an error because it expects two operands. This happens because of sudo, which by default does not preserve environment variables; to do so you should use the -E option (see for reference the sudo manpage). Also I believe that you are missing the -hostfile option to mpirun within the container, to inform the launcher of the available hosts. More generally, it is not necessary to use the SSH hook in conjunction with OpenMPI. The cookbook page you are referring to shows how the SSH hook could be used to enable OpenMPI communication, but there are other possibilities. As an example, if you want to run with the MPI stack from the container image, you could leverage the PMI2 process management interface, which Sarus is able to propagate into containers. You may find more information about this approach here.

NicholasRasi commented 3 years ago

Hello @Madeeks, thank for your reply.

  1. Ok, thank you, I look forward to hearing from you soon.
  2. Yes, you are right I was missing the -E option and the host file. By the way, if I launch:
    salloc -N 2 --cpus-per-task 60
    srun sudo -E /opt/sarus/1.3.0-Release/bin/sarus run --ssh
    --mount=src=/home/user,dst=/home/user,type=bind
    --mount=src=/dev/infiniband/uverbs0,dst=/dev/infiniband/uverbs0,type=bind
    nichr/hpc-bench:v2 echo $SLURM_PROCID

    the execution stucks (while it does not without -E).

On the other hand, I tried to run the following bash script:

#!/bin/bash
#SBATCH --job-name=osu_sarus
#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
#SBATCH --time=00:10:00
#SBATCH --output=res_mpi.txt
#SBATCH --err=err_mpi.txt
#SBATCH --partition=hpc

module purge
module load mpi/openmpi

mpirun --map-by node -mca pml ucx --mca btl ^vader,tcp,openib -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_IB_PKEY=$UCX_IB_PKEY \
        sudo /opt/sarus/1.3.0-Release/bin/sarus run \
        --mount=src=/dev/infiniband/uverbs0,dst=/dev/infiniband/uverbs0,type=bind \
        nichr/hpc-bench:v2 \
        /opt/benchmarks/mpiBench/mpiBench -e 1K

The execution completed giving the following result:

$ cat res_mpi.txt
START mpiBench v1.5
0 : worker1
Barrier                 Bytes:         0        Iters:     1000 Avg:      0.0061        Min:      0.0061        Max:      0.0061        Comm: MPI_COMM_WORLD    Ranks: 1
Bcast                   Bytes:         0        Iters:     1000 Avg:      0.0138        Min:      0.0138        Max:      0.0138        Comm: MPI_COMM_WORLD    Ranks: 1
...
Allgatherv              Bytes:      1024        Iters:     1000 Avg:      0.0156        Min:      0.0156        Max:      0.0156        Comm: MPI_COMM_WORLD    Ranks: 1
START mpiBench v1.5
0 : worker2
Barrier                 Bytes:         0        Iters:     1000 Avg:      0.0062        Min:      0.0062        Max:      0.0062        Comm: MPI_COMM_WORLD    Ranks: 1
Bcast                   Bytes:         0        Iters:     1000 Avg:      0.0338        Min:      0.0338        Max:      0.0338        Comm: MPI_COMM_WORLD    Ranks: 1
Bcast                   Bytes:         1        Iters:     1000 Avg:      0.0339        Min:      0.0339        Max:      0.0339        Comm: MPI_COMM_WORLD    Ranks: 1
...
Reduce                  Bytes:      1024        Iters:     1000 Avg:      0.0586        Min:      0.0586        Max:      0.0586        Comm: MPI_COMM_WORLD    Ranks: 1
Message buffers (KB):   2
END mpiBench
Message buffers (KB):   2
END mpiBench
$ cat err_mpi.txt
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            worker1
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4120

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            worker2
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4120

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   worker1
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   worker2
  Local device: mlx5_0
--------------------------------------------------------------------------

As far as I understand the workers do not communicate.

If I launch the application with srun and -mpi=pmi2

salloc -N 2 --cpus-per-task 60
srun -N2 --mpi=pmi2 sudo /opt/sarus/1.3.0-Release/bin/sarus run \
    --mount=src=/dev/infiniband/uverbs0,dst=/dev/infiniband/uverbs0,type=bind \
    nichr/hpc-bench:v2 \
    /opt/benchmarks/mpiBench/mpiBench -e 1K

I get a similar result.

I also ran a batch script with MVAPICH2 and the Sarus MPI hook

#!/bin/bash
#SBATCH --job-name=osu_sarus
#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
#SBATCH --time=00:10:00
#SBATCH --output=res_mpi.txt
#SBATCH --err=err_mpi.txt
#SBATCH --partition=hpc

module purge
module load mpi/mvapich2

srun sarus run --mpi \
        nichr/hpc-bench:v4 \
        /opt/benchmarks/mpiBench/mpiBench -e 1K

I did not get any error but the workers are separated as in the previous result.

On my cluster I have MVAPICH2 2.3.4 while on the guide the recommended version is the MVAPICH2 2.2, do you think it can be a problem? Are the workers separated due to the launch of Sarus with sudo?

Thank you