shijieheping commented 6 years ago

On two different GPU clusters, nv_peer_mem NCCL2 failed to pass nccl sanity tests. MVAPICH2-GDR + gdrcopy passwd the tests with the Same HW/SW.

This is related to issue under nccl-tests https://github.com/NVIDIA/nccl-tests/issues/7 Anyone can help? Thanks.

Configurations:

MPI: OpenMPI 1.8.8/2.1.3/3.0.1 CUDA lib: CUDA 8.0/9.0/9.1 NCCL lib: NCCL 2.0.5/2.1.15 GDR lib: nv_peer_memory master OFED: MLNX_OFED_LINUX-4.2-1 OS: Ubuntu1604/CentOS7.4 GPU: Kepler K80/Pascal P100 Server: Supermicro 4028-TR/4028-TR2 Topo interconnect: PIX Driver Version: 390.30

To Reproduce

nccl-tests fail with GDR enabled:

-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 -mca btl_openib_want_cuda_gdr 1

[15:37:58](root):~ # /root/mpi/cuda-9.0/ompi3-cuda/bin/mpirun -v --allow-run-as-root \
-x NCCL_SOCKET_IFNAME=ib0 -x NCCL_DEBUG=1 \
-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 -mca btl_openib_want_cuda_gdr 1 \
-x LD_LIBRARY_PATH=/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64/lib:/usr/local/cuda-9.0/lib64 \
-mca btl_openib_if_include mlx5_3:1 \
-np 2 -host clx-mld-45,clx-mld-46 -pernode --oversubscribe \
/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64/ompi3tests/all_reduce_perf -b 9 -e 4M -g 1 -c 1 -z 0

nThread 1 nGpus 1 minBytes 9 maxBytes 4194304 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
# NCCL Tests compiled with NCCL 2.1
# Using devices
#   Rank  0 on clx-mld-45 device  0 [0x04] Tesla P100-PCIE-16GB

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  1 on clx-mld-46 device  0 [0x04] Tesla P100-PCIE-16GB
           8             2   float     sum    0.144   0.00   0.00    0e+00    0.015   0.00   0.00    0e+00
     1048584        262146   float     sum    0.212   4.95   4.95    2e+00    0.209   5.02   5.02    2e+00
     2097160        524290   float     sum    0.379   5.53   5.53    2e+00    0.379   5.53   5.53    2e+00
     3145736        786434   float     sum    0.549   5.73   5.73    2e+00    0.548   5.74   5.74    2e+00
 Out of bounds values : 24 FAILED
 Avg bus bandwidth    : 4.06216 

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[2940,1],0]
  Exit code:    1
--------------------------------------------------------------------------

nccl-tests OK, with GDR disabled:

-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=0 -mca btl_openib_want_cuda_gdr 1

[15:50:24](root):~/mpi # /root/mpi/cuda-9.0/ompi3-cuda/bin/mpirun -v --allow-run-as-root \
-x NCCL_SOCKET_IFNAME=ib0 -x NCCL_DEBUG=1 \
-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=0 -mca btl_openib_want_cuda_gdr 1 \
-x LD_LIBRARY_PATH=/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64/lib:/usr/local/cuda-9.0/lib64 \
-mca btl_openib_if_include mlx5_3:1 \
-np 2 -host clx-mld-45,clx-mld-46 -pernode --oversubscribe \
/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64/ompi3tests/all_reduce_perf -b 9 -e 4M -g 1 -c 1 -z 0

nThread 1 nGpus 1 minBytes 9 maxBytes 4194304 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
# NCCL Tests compiled with NCCL 2.1
# Using devices
#   Rank  0 on clx-mld-45 device  0 [0x04] Tesla P100-PCIE-16GB

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  1 on clx-mld-46 device  0 [0x04] Tesla P100-PCIE-16GB
           8             2   float     sum    0.087   0.00   0.00    0e+00    0.018   0.00   0.00    0e+00
     1048584        262146   float     sum    0.396   2.65   2.65    0e+00    0.394   2.66   2.66    0e+00
     2097160        524290   float     sum    0.772   2.72   2.72    0e+00   25.292   0.08   0.08    0e+00
     3145736        786434   float     sum   27.539   0.11   0.11    0e+00   69.042   0.05   0.05    0e+00
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 1.03398

To Building the faulty OpenMPI environment:

OpenMPI

cd /root/mpi/cuda-8.0/ompi3.0.1 && \
    rm -fr /root/mpi/cuda-8.0/ompi3.0.1/* && git checkout v3.0.1 && git reset --hard && \
    ./autogen.pl && \
    CC=/usr/bin/gcc CXX=/usr/bin/g++ FC=/usr/bin/gfortran ./configure --with-verbs --with-cuda=/usr/local/cuda-8.0 --prefix=/root/mpi/cuda-8.0/ompi3-cuda && \
    time make -j $(nproc) install

nccl-tests

cd /root/mpi/cuda-9.1/git/nccl-tests && \
    make MPI=1 NCCL_HOME=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64 CUDA_HOME=/usr/local/cuda-9.1 MPI_HOME=/root/mpi/cuda-9.1/ompi1-cuda DST_DIR=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64/ompi1tests -j $(nproc) && \
    make MPI=1 NCCL_HOME=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64 CUDA_HOME=/usr/local/cuda-9.1 MPI_HOME=/root/mpi/cuda-9.1/ompi2-cuda DST_DIR=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64/ompi2tests -j $(nproc) && \
    make MPI=1 NCCL_HOME=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64 CUDA_HOME=/usr/local/cuda-9.1 MPI_HOME=/root/mpi/cuda-9.1/ompi3-cuda DST_DIR=/root/mpi/cuda-9.1/nccl_2.1.15-1+cuda9.1_x86_64/ompi3tests -j $(nproc)

The Same HW/SW and Tests work properly with MVAPICH2-GDR + gdrcopy

nccl-tests OK, with GDR enabled:

-genv NCCL_IB_DISABLE=0 -genv NCCL_IB_CUDA_SUPPORT=1

[16:06:47](root):~/mpi # /opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.0/mofed4.2/mpirun/gnu4.8.5/bin/mpirun \
-genv LD_LIBRARY_PATH=/root/mpi/cuda-9.0/nccl_2.0.5-3+cuda9.0_amd64/lib:/usr/local/cuda-9.0/lib64:/opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.0/mofed4.2/mpirun/gnu4.8.5/lib64 \
-genv MV2_GPUDIRECT_GDRCOPY_LIB=/root/mpi/cuda-9.0/gdr/lib64/libgdrapi.so \
-genv GDRCOPY_ENABLE_LOGGING=1 -genv GDRCOPY_LOG_LEVEL=5 -genv MV2_USE_GPUDIRECT=1 \
-genv NCCL_IB_DISABLE=0 -genv NCCL_IB_CUDA_SUPPORT=1 -genv NCCL_DEBUG=0 -genv NCCL_SOCKET_IFNAME=enp5s0f0 \
-np 2 -host clx-mld-45,clx-mld-46  /root/mpi/cuda-9.0/nccl_2.0.5-3+cuda9.0_amd64/mvapich2tests/all_reduce_perf -b 9 -e 4M -g 4 -c 1 -z 0

nThread 1 nGpus 4 minBytes 9 maxBytes 4194304 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
# NCCL Tests compiled with NCCL 2.0
# Using devices
#   Rank  0 on clx-mld-45 device  0 [0x04] Tesla P100-PCIE-16GB
#   Rank  1 on clx-mld-45 device  1 [0x06] Tesla P100-PCIE-16GB
#   Rank  2 on clx-mld-45 device  2 [0x07] Tesla P100-PCIE-16GB
#   Rank  3 on clx-mld-45 device  3 [0x08] Tesla P100-PCIE-16GB
#   Rank  4 on clx-mld-46 device  0 [0x04] Tesla P100-PCIE-16GB
#   Rank  5 on clx-mld-46 device  1 [0x06] Tesla P100-PCIE-16GB
#   Rank  6 on clx-mld-46 device  2 [0x07] Tesla P100-PCIE-16GB

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  7 on clx-mld-46 device  3 [0x08] Tesla P100-PCIE-16GB
           8             2   float     sum    0.149   0.00   0.00    0e+00    0.151   0.00   0.00    0e+00
     1048584        262146   float     sum    0.308   3.41   5.96    1e-06    0.304   3.45   6.04    1e-06
     2097160        524290   float     sum    0.491   4.27   7.48    1e-06    0.486   4.32   7.56    1e-06
     3145736        786434   float     sum    0.678   4.64   8.12    1e-06    0.678   4.64   8.12    1e-06
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 5.40981

To Building the workable MVAPICH2 environment:

gdrcopy

cd /root/mpi/cuda-8.0/git/gdrcopy && \
    make PREFIX=/root/mpi/cuda-8.0/gdr CUDA=/usr/local/cuda-8.0 -j $(nproc) all install
cd /root/mpi/cuda-9.0/git/gdrcopy && \
    make PREFIX=/root/mpi/cuda-9.0/gdr CUDA=/usr/local/cuda-9.0 -j $(nproc) all install

nccl-tests

cd /root/mpi/cuda-9.0/git/nccl-tests && \
    make MPI=1 NCCL_HOME=/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64 CUDA_HOME=/usr/local/cuda-9.0 MPI_HOME=/opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.0/mofed4.2/mpirun/gnu4.8.5 LIBRARY_PATH=/opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.0/mofed4.2/mpirun/gnu4.8.5/lib64 DST_DIR=/root/mpi/cuda-9.0/nccl_2.1.15-1+cuda9.0_x86_64/mvapich2tests -j $(nproc)

drossetti commented 6 years ago

This has been recently identified as a bug in the 390 GPU driver series. Please try downgrading to a GPU driver older than 390.19. We are working on a driver fix as well as a work-around in a future NCCL release.

drossetti commented 6 years ago

GPU driver 390.56 should have the required fix.

drossetti commented 6 years ago

@ferasd I guess you can close this one

ferasd commented 6 years ago

Thanks.

Mellanox / nv_peer_memory

nv_peer_mem NCCL2 nccl-tests fails with: Out of bounds values : 24 FAILED #38

Configurations:

To Reproduce

nccl-tests fail with GDR enabled:

nccl-tests OK, with GDR disabled:

To Building the faulty OpenMPI environment:

OpenMPI

nccl-tests

The Same HW/SW and Tests work properly with MVAPICH2-GDR + gdrcopy

nccl-tests OK, with GDR enabled:

To Building the workable MVAPICH2 environment:

gdrcopy

nccl-tests