Open vilmara opened 6 years ago
@mpatwary , can you help with this?
It looks like you are using the right command and I think the problem is unrelated to nccl_mpi_all_reduce. Do the other MPI implementations run well like ring_all_reduce and osu_allreduce? I suspect the problem could be the setup. Does any other code with MPI run well in your system?
hi @mpatwary, my system has 2 nodes, each with 4 P100 GPUs (total 8 gpus) connected using infiniband, I was wonder how mpirun communicates between the nodes to implement the distributed benchmark?
ring_all_reduce and osu_allreduce are throwing errors when I compile the DeepBench benchmarks:
Compilation:
make CUDA_PATH=/usr/local/cuda-9.1 CUDNN_PATH=/usr/local/cuda/include/ MPI_PATH=/home/dell/.openmpi/ NCCL_PATH=/home/$USER/.openmpi/ ARCH=sm_60
Normal outputs and errors: mkdir -p bin make -C nvidia make[1]: Entering directory '/home/dell/DeepBench/code/nvidia' mkdir -p bin /usr/local/cuda-9.1/bin/nvcc gemm_bench.cu -DUSE_TENSOR_CORES=0 -DPAD_KERNELS=1 -o bin/gemm_bench -I ../kernels/ -I /usr/local/cuda-9.1/include -L /usr/local/cuda-9.1/lib64 -lcublas -L /usr/local/cuda-9.1/lib64 -lcurand --generate-code arch=compute_60,code=sm_60 -std=c++11 mkdir -p bin /usr/local/cuda-9.1/bin/nvcc conv_bench.cu -DUSE_TENSOR_CORES=0 -DPAD_KERNELS=1 -o bin/conv_bench -I ../kernels/ -I /usr/local/cuda-9.1/include -I /usr/local/cuda/include//include/ -L /usr/local/cuda/include//lib64/ -L /usr/local/cuda-9.1/lib64 -lcurand -lcudnn --generate-code arch=compute_60,code=sm_60 -std=c++11 mkdir -p bin /usr/local/cuda-9.1/bin/nvcc rnn_bench.cu -DUSE_TENSOR_CORES=0 -o bin/rnn_bench -I ../kernels/ -I /usr/local/cuda-9.1/include -I /usr/local/cuda/include//include/ -L /usr/local/cuda/include//lib64/ -L /usr/local/cuda-9.1/lib64 -lcurand -lcudnn --generate-code arch=compute_60,code=sm_60 -std=c++11 mkdir -p bin /usr/local/cuda-9.1/bin/nvcc nccl_single_all_reduce.cu -o bin/nccl_single_all_reduce -I ../kernels/ -I /home/root/.openmpi//include/ -I /usr/local/cuda/include//include/ -L /home/root/.openmpi//lib/ -L /usr/local/cuda/include//lib64 -lnccl -lcudart -lcurand --generate-code arch=compute_60,code=sm_60 -std=c++11 mkdir -p bin /usr/local/cuda-9.1/bin/nvcc nccl_mpi_all_reduce.cu -o bin/nccl_mpi_all_reduce -I ../kernels/ -I /home/root/.openmpi//include/ -I /usr/local/cuda/include//include/ -I /home/dell/.openmpi//include -L /home/root/.openmpi//lib/ -L /usr/local/cuda/include//lib64 -L /home/dell/.openmpi//lib -lnccl -lcurand -lcudart -lmpi --generate-code arch=compute_60,code=sm_60 -std=c++11 make[1]: Leaving directory '/home/dell/DeepBench/code/nvidia' cp nvidia/bin/* bin rm -rf nvidia/bin mkdir -p bin make -C osu_allreduce make[1]: Entering directory '/home/dell/DeepBench/code/osu_allreduce' mkdir -p bin gcc -o bin/osu_coll.o -c -O2 -pthread -Wall -march=native -I/usr/local/cuda-9.1/include -I/home/dell/.openmpi//include osu_coll.c gcc -o bin/osu_allreduce.o -c -O2 -pthread -Wall -march=native -I ../kernels/ -I/usr/local/cuda-9.1/include -I/home/dell/.openmpi//include osu_allreduce.c gcc -o bin/osu_allreduce -pthread -Wl,--enable-new-dtags -Wl,-rpath=/usr/local/cuda-9.1/lib64 -Wl,-rpath=/home/dell/.openmpi//lib bin/osu_allreduce.o bin/osu_coll.o -L/usr/local/cuda-9.1/lib64 -L/home/dell/.openmpi//lib -lstdc++ -lmpi_cxx -lmpi -lcuda /usr/bin/ld: cannot find -lmpi_cxx collect2: error: ld returned 1 exit status Makefile:17: recipe for target 'build' failed make[1]: [build] Error 1 make[1]: Leaving directory '/home/dell/DeepBench/code/osu_allreduce' Makefile:6: recipe for target 'osu_allreduce' failed make: [osu_allreduce] Error 2
I have recompiled, and ran it again with 4 and 8 gpus but now I got another the below error:
mpirun --allow-run-as-root -np 8 bin/nccl_mpi_all_reduce
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[41026,1],0] Exit code: 127
Looks like the code is not getting the path to the mpi lib directory. You can try exporting that.
@mpatwary, thanks for your prompt reply. I exported that and have gotten other errors. My system has 2 nodes, each with 4 P100 GPUs (total 8 GPUs) connected using InfiniBand, I was wonder how mpirun communicates between the nodes to implement the distributed benchmark?. It looks like the command mpirun --allow-run-as-root -np 8 bin/nccl_mpi_all_reduce is just considering the host node only; my understanding is that mpirun should receive the flag -H with the ib address of both servers (I tried this option but got errors too). Can you share the command line you have used to implement DeepBench nccl_mpi_all_reduce with multinode and multi GPUs systems?
Here is the error I am getting considering just the 4 GPUs of the host server:
mpirun --allow-run-as-root -np 4 bin/nccl_mpi_all_reduce
An error occurred in MPI_Init
on a NULL communicator
MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
and potentially your MPI job)
[host-P100-2:10830] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[36721,1],0] Exit code: 1
I have a problem here as well. Normal single version works fine. All other MPI applications are working. But i get this one here:
# of floats bytes transferred Avg Time (msec) Max Time (msec)
terminate called after throwing an instance of 'std::runtime_error' terminate called after throwing an instance of 'std::runtime_error' what(): NCCL failure: unhandled cuda error in nccl_mpi_all_reduce.cu at line: 86 rank: 0
[jetson-3:29969] Process received signal [jetson-3:29969] Signal: Aborted (6) [jetson-3:29969] Signal code: (-6) what(): NCCL failure: unhandled cuda error in nccl_mpi_all_reduce.cu at line: 86 rank: 1
mpiexec noticed that process rank 0 with PID 0 on node jetson-3 exited on signal 6 (Aborted).
Hi all,
What is the command line to run nccl_mpi_all_reduce on a multi-node system (2 nodes with 4 GPUs each one)?, and I am getting the below error when typing this command:
Local host: C4-1
terminate called after throwing an instance of 'std::runtime_error' what(): Failed to set cuda device
When running only with 4 ranks, I get this output:
Local host: C4-1
NCCL MPI AllReduce Num Ranks: 4
[C4130-1:04094] 3 more processes have sent help message help-mpi-btl-openib.txt / no active ports found [C4130-1:04094] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages 100000 400000 0.148489 0.148565 3097600 12390400 2.63694 2.63695 4194304 16777216 3.57147 3.57148 6553600 26214400 5.59742 5.59744 16777217 67108868 81.9391 81.9396 38360000 153440000 32.6457 32.6462
Thanks