Closed shijieheping closed 6 years ago
This has been recently identified as a bug in the 390 GPU driver series. Please try downgrading to a GPU driver older than 390.19. We are working on a driver fix as well as a work-around in a future NCCL release.
GPU driver 390.56 should have the required fix.
@ferasd I guess you can close this one
Thanks.
On two different GPU clusters, nv_peer_mem NCCL2 failed to pass nccl sanity tests. MVAPICH2-GDR + gdrcopy passwd the tests with the Same HW/SW.
This is related to issue under nccl-tests https://github.com/NVIDIA/nccl-tests/issues/7 Anyone can help? Thanks.
Configurations:
MPI: OpenMPI 1.8.8/2.1.3/3.0.1 CUDA lib: CUDA 8.0/9.0/9.1 NCCL lib: NCCL 2.0.5/2.1.15 GDR lib: nv_peer_memory master OFED: MLNX_OFED_LINUX-4.2-1 OS: Ubuntu1604/CentOS7.4 GPU: Kepler K80/Pascal P100 Server: Supermicro 4028-TR/4028-TR2 Topo interconnect: PIX Driver Version: 390.30
To Reproduce
nccl-tests fail with GDR enabled:
-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=1 -mca btl_openib_want_cuda_gdr 1
nccl-tests OK, with GDR disabled:
-x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=0 -mca btl_openib_want_cuda_gdr 1
To Building the faulty OpenMPI environment:
OpenMPI
nccl-tests
The Same HW/SW and Tests work properly with MVAPICH2-GDR + gdrcopy
nccl-tests OK, with GDR enabled:
-genv NCCL_IB_DISABLE=0 -genv NCCL_IB_CUDA_SUPPORT=1
To Building the workable MVAPICH2 environment:
gdrcopy
nccl-tests