NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

Rank Assignment Issue under four containers on two different servers. #218

Closed thsmfe001 closed 2 months ago

thsmfe001 commented 2 months ago

I faced with rank asssigment issue during nccl test on four containers. My test environment is two servers with two GPUs per a server. When i issued below command i got wrong rank assignment that i inteded. My intention was the rank should be assinged from 0 to 7 distributed manner. But i got below output. Could you provide any solution to solve this problem?

root@c5e62fb2396d:/workspace# cat rankfile rank 0=10.10.10.2 slot=0 rank 1=10.10.11.2 slot=0 rank 2=10.10.20.2 slot=0 rank 3=10.10.21.2 slot=0 rank 4=10.10.10.2 slot=1 rank 5=10.10.11.2 slot=1 rank 6=10.10.20.2 slot=1 rank 7=10.10.21.2 slot=1 root@c5e62fb2396d:/workspace# mpirun -np 4 -allow-run-as-root -host 10.10.10.2,10.10.11.2,10.10.20.2,10.10.21.2 -rf rankfile /workspace/software/nccl-tests-master/build/all_reduce_perf -b 8 -e 128M -f 2 -g 2

WARNING: Open MPI tried to bind a process but failed. This is a warning only; your job will continue, though performance may be degraded.

Local host: c5e62fb2396d Application name: /workspace/software/nccl-tests-master/build/all_reduce_perf Error message: failed to bind memory Location: rtc_hwloc.c:447


nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

Using devices

nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

Using devices

nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

Using devices

nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

Using devices

Rank 0 Group 0 Pid 3077 on a1eb0914a5e2 device 0 [0x61] NVIDIA L40

Rank 1 Group 0 Pid 3077 on a1eb0914a5e2 device 1 [0xe1] NVIDIA L40

Rank 0 Group 0 Pid 4482 on cb0142391811 device 0 [0x61] NVIDIA L40

Rank 1 Group 0 Pid 4482 on cb0142391811 device 1 [0xe1] NVIDIA L40

Rank 0 Group 0 Pid 1176 on c5e62fb2396d device 0 [0xca] NVIDIA L40

Rank 1 Group 0 Pid 1176 on c5e62fb2396d device 1 [0xe1] NVIDIA L40

Rank 0 Group 0 Pid 3860 on 877a4a03d442 device 0 [0xca] NVIDIA L40

Rank 1 Group 0 Pid 3860 on 877a4a03d442 device 1 [0xe1] NVIDIA L40

AddyLaddy commented 2 months ago

It doesn't look like the nccl-tests were compiled with MPI=1

thsmfe001 commented 2 months ago

I just followed instruction of readme page. I just downloaded and execute make command. You mean i need to recomplie with below command? make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl

AddyLaddy commented 2 months ago

Yes, it looks like you're trying to run binaries that are not MPI enabled so you just end up with 4 processes each with 2 GPUs.

thsmfe001 commented 2 months ago

Thank you so much. I will try on it and then i will post the result to you.

thsmfe001 commented 2 months ago

I just succeeded reconpiling with MPI options. Then i got below error messages. Based on my investigation of recompiled library, np 1 with any hosts can work proprely but more two processors with np 2 leaded to error. I think it would be caused by MPI communication. Could you check attached error logs?

root@c5e62fb2396d:/workspace# mpirun -np 4 -allow-run-as-root -host 10.10.10.2,10.10.11.2,10.10.20.2,10.10.21.2 /workspace/software/nccl-tests-master/build/all_reduce_perf -b 8 -e 128M -f 2 -g 1 [1716868653.124073] [c5e62fb2396d:1618 :0] tl_ucp_ep.c:43 TL_UCP ERROR ucp returned connect error: Endpoint timeout [1716868653.124093] [c5e62fb2396d:1618 :0] tl_ucp_ep.h:79 TL_UCP ERROR failed to connect team ep [1716868653.124098] [c5e62fb2396d:1618 :0] tl_ucp_sendrecv.h:108 TL_UCP ERROR tag 32760; dest 1; team_id 0; errmsg No pending message [c5e62fb2396d:1618 :0:1618] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffffffffefb) [1716868653.124031] [877a4a03d442:1142 :0] tl_ucp_ep.c:43 TL_UCP ERROR ucp returned connect error: Endpoint timeout [1716868653.124050] [877a4a03d442:1142 :0] tl_ucp_ep.h:79 TL_UCP ERROR failed to connect team ep [1716868653.124056] [877a4a03d442:1142 :0] tl_ucp_sendrecv.h:108 TL_UCP ERROR tag 32760; dest 0; team_id 0; errmsg No pending message [1716868653.119230] [cb0142391811:1139 :0] tl_ucp_ep.c:43 TL_UCP ERROR ucp returned connect error: Endpoint timeout [1716868653.119252] [cb0142391811:1139 :0] tl_ucp_ep.h:79 TL_UCP ERROR failed to connect team ep [1716868653.119258] [cb0142391811:1139 :0] tl_ucp_sendrecv.h:108 TL_UCP ERROR tag 32760; dest 3; team_id 0; errmsg No pending message [877a4a03d442:1142 :0:1142] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffffffffefb) [cb0142391811:1139 :0:1139] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffffffffefb) [1716868653.119305] [a1eb0914a5e2:1167 :0] tl_ucp_ep.c:43 TL_UCP ERROR ucp returned connect error: Endpoint timeout [1716868653.119324] [a1eb0914a5e2:1167 :0] tl_ucp_ep.h:79 TL_UCP ERROR failed to connect team ep [1716868653.119329] [a1eb0914a5e2:1167 :0] tl_ucp_sendrecv.h:108 TL_UCP ERROR tag 32760; dest 2; team_id 0; errmsg No pending message [a1eb0914a5e2:1167 :0:1167] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffffffffefb) [Uploading error logs.txt…]()

AddyLaddy commented 2 months ago

It looks like you are having issues running MPI jobs. Perhaps get a simple "hello world" MPI program working first before attempting to run the NCCL tests. But with UCX based MPI I often find export UCX_TLS=tcp helps most issues. You may also need to select the correct UCX device with UCX_NET_DEVICES

thsmfe001 commented 2 months ago

Thank you for your quick feedback. I just recompile with all option with make command based on readme page. "make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu" After test again based new library and if i faced with same issue i'll adopt your recommandation. I'll update the result to you. Thank you.

thsmfe001 commented 2 months ago

Thank you for support. After recompling and applying UCX_TLS=tcp the test was well done. I really appreciate you about quick support again.