Error with nccl_mpi_all_reduce on multinode system

vilmara commented 6 years ago

Hi all,

What is the command line to run nccl_mpi_all_reduce on a multi-node system (2 nodes with 4 GPUs each one)?, and I am getting the below error when typing this command:

DeepBench/code$ mpirun -np 8 bin/nccl_mpi_all_reduce

WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job.

Local host: C4-1

terminate called after throwing an instance of 'std::runtime_error' what(): Failed to set cuda device

When running only with 4 ranks, I get this output:

DeepBench/code$ mpirun -np 4 bin/nccl_mpi_all_reduce

WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job.

Local host: C4-1

NCCL MPI AllReduce Num Ranks: 4

# of floats    bytes transferred    Avg Time (msec)    Max Time (msec)

[C4130-1:04094] 3 more processes have sent help message help-mpi-btl-openib.txt / no active ports found [C4130-1:04094] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 100000 400000 0.148489 0.148565 3097600 12390400 2.63694 2.63695 4194304 16777216 3.57147 3.57148 6553600 26214400 5.59742 5.59744 16777217 67108868 81.9391 81.9396 38360000 153440000 32.6457 32.6462

Thanks

sharannarang commented 6 years ago

@mpatwary , can you help with this?

mpatwary commented 6 years ago

It looks like you are using the right command and I think the problem is unrelated to nccl_mpi_all_reduce. Do the other MPI implementations run well like ring_all_reduce and osu_allreduce? I suspect the problem could be the setup. Does any other code with MPI run well in your system?

vilmara commented 6 years ago

hi @mpatwary, my system has 2 nodes, each with 4 P100 GPUs (total 8 gpus) connected using infiniband, I was wonder how mpirun communicates between the nodes to implement the distributed benchmark?

ring_all_reduce and osu_allreduce are throwing errors when I compile the DeepBench benchmarks:

Compilation: make CUDA_PATH=/usr/local/cuda-9.1 CUDNN_PATH=/usr/local/cuda/include/ MPI_PATH=/home/dell/.openmpi/ NCCL_PATH=/home/$USER/.openmpi/ ARCH=sm_60

Normal outputs and errors: mkdir -p bin make -C nvidia make[1]: Entering directory '/home/dell/DeepBench/code/nvidia' mkdir -p bin /usr/local/cuda-9.1/bin/nvcc gemm_bench.cu -DUSE_TENSOR_CORES=0 -DPAD_KERNELS=1 -o bin/gemm_bench -I ../kernels/ -I /usr/local/cuda-9.1/include -L /usr/local/cuda-9.1/lib64 -lcublas -L /usr/local/cuda-9.1/lib64 -lcurand --generate-code arch=compute_60,code=sm_60 -std=c++11 mkdir -p bin /usr/local/cuda-9.1/bin/nvcc conv_bench.cu -DUSE_TENSOR_CORES=0 -DPAD_KERNELS=1 -o bin/conv_bench -I ../kernels/ -I /usr/local/cuda-9.1/include -I /usr/local/cuda/include//include/ -L /usr/local/cuda/include//lib64/ -L /usr/local/cuda-9.1/lib64 -lcurand -lcudnn --generate-code arch=compute_60,code=sm_60 -std=c++11 mkdir -p bin /usr/local/cuda-9.1/bin/nvcc rnn_bench.cu -DUSE_TENSOR_CORES=0 -o bin/rnn_bench -I ../kernels/ -I /usr/local/cuda-9.1/include -I /usr/local/cuda/include//include/ -L /usr/local/cuda/include//lib64/ -L /usr/local/cuda-9.1/lib64 -lcurand -lcudnn --generate-code arch=compute_60,code=sm_60 -std=c++11 mkdir -p bin /usr/local/cuda-9.1/bin/nvcc nccl_single_all_reduce.cu -o bin/nccl_single_all_reduce -I ../kernels/ -I /home/root/.openmpi//include/ -I /usr/local/cuda/include//include/ -L /home/root/.openmpi//lib/ -L /usr/local/cuda/include//lib64 -lnccl -lcudart -lcurand --generate-code arch=compute_60,code=sm_60 -std=c++11 mkdir -p bin /usr/local/cuda-9.1/bin/nvcc nccl_mpi_all_reduce.cu -o bin/nccl_mpi_all_reduce -I ../kernels/ -I /home/root/.openmpi//include/ -I /usr/local/cuda/include//include/ -I /home/dell/.openmpi//include -L /home/root/.openmpi//lib/ -L /usr/local/cuda/include//lib64 -L /home/dell/.openmpi//lib -lnccl -lcurand -lcudart -lmpi --generate-code arch=compute_60,code=sm_60 -std=c++11 make[1]: Leaving directory '/home/dell/DeepBench/code/nvidia' cp nvidia/bin/* bin rm -rf nvidia/bin mkdir -p bin make -C osu_allreduce make[1]: Entering directory '/home/dell/DeepBench/code/osu_allreduce' mkdir -p bin gcc -o bin/osu_coll.o -c -O2 -pthread -Wall -march=native -I/usr/local/cuda-9.1/include -I/home/dell/.openmpi//include osu_coll.c gcc -o bin/osu_allreduce.o -c -O2 -pthread -Wall -march=native -I ../kernels/ -I/usr/local/cuda-9.1/include -I/home/dell/.openmpi//include osu_allreduce.c gcc -o bin/osu_allreduce -pthread -Wl,--enable-new-dtags -Wl,-rpath=/usr/local/cuda-9.1/lib64 -Wl,-rpath=/home/dell/.openmpi//lib bin/osu_allreduce.o bin/osu_coll.o -L/usr/local/cuda-9.1/lib64 -L/home/dell/.openmpi//lib -lstdc++ -lmpi_cxx -lmpi -lcuda /usr/bin/ld: cannot find -lmpi_cxx collect2: error: ld returned 1 exit status Makefile:17: recipe for target 'build' failed make[1]: [build] Error 1 make[1]: Leaving directory '/home/dell/DeepBench/code/osu_allreduce' Makefile:6: recipe for target 'osu_allreduce' failed make: [osu_allreduce] Error 2

I have recompiled, and ran it again with 4 and 8 gpus but now I got another the below error:

mpirun --allow-run-as-root -np 8 bin/nccl_mpi_all_reduce Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.

bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[41026,1],0] Exit code: 127

mpatwary commented 6 years ago

Looks like the code is not getting the path to the mpi lib directory. You can try exporting that.

vilmara commented 6 years ago

@mpatwary, thanks for your prompt reply. I exported that and have gotten other errors. My system has 2 nodes, each with 4 P100 GPUs (total 8 GPUs) connected using InfiniBand, I was wonder how mpirun communicates between the nodes to implement the distributed benchmark?. It looks like the command mpirun --allow-run-as-root -np 8 bin/nccl_mpi_all_reduce is just considering the host node only; my understanding is that mpirun should receive the flag -H with the ib address of both servers (I tried this option but got errors too). Can you share the command line you have used to implement DeepBench nccl_mpi_all_reduce with multinode and multi GPUs systems?

Here is the error I am getting considering just the 4 GPUs of the host server: mpirun --allow-run-as-root -np 4 bin/nccl_mpi_all_reduce An error occurred in MPI_Init on a NULL communicator MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, and potentially your MPI job) [host-P100-2:10830] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[36721,1],0] Exit code: 1

laserrapt0r commented 3 years ago

I have a problem here as well. Normal single version works fine. All other MPI applications are working. But i get this one here:

NCCL MPI AllReduce Num Ranks: 2

# of floats    bytes transferred    Avg Time (msec)    Max Time (msec)

terminate called after throwing an instance of 'std::runtime_error' terminate called after throwing an instance of 'std::runtime_error' what(): NCCL failure: unhandled cuda error in nccl_mpi_all_reduce.cu at line: 86 rank: 0

[jetson-3:29969] Process received signal [jetson-3:29969] Signal: Aborted (6) [jetson-3:29969] Signal code: (-6) what(): NCCL failure: unhandled cuda error in nccl_mpi_all_reduce.cu at line: 86 rank: 1

[jetson-2:08669] Process received signal [jetson-2:08669] Signal: Aborted (6) [jetson-2:08669] Signal code: (-6) [jetson-2:08669] End of error message [jetson-3:29969] End of error message

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpiexec noticed that process rank 0 with PID 0 on node jetson-3 exited on signal 6 (Aborted).

baidu-research / DeepBench