NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

NCCL Test hang when the number of nodes goes beyond 18, and CPU usage is very high #193

Open chgdragon2023 opened 6 months ago

chgdragon2023 commented 6 months ago

I am able to run nccl test up to 17 nodes. However, it always hang when it goes to 18 nodes. I found that the CPU usage of the master server is very high when it hang, show as below (CPU usage goes to 804.3% ? ):  

      Tasks: 2430 total,   9 running, 2421 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  3.6 us,  3.6 sy,  0.0 ni, 92.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
      MiB Mem : 2063843.+total, 1980002.+free,  46554.4 used,  37286.6 buff/cache
      MiB Swap:      0.0 total,      0.0 free,      0.0 used. 2004956.+avail Mem 

          PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                                                                                                                         
       732061 ocp_user  20   0  186.5g   2.6g 813668 S 804.3   0.1  16:30.73 all_reduce_perf 

With DEBUG = INFO, it shows the it hangs at this place:

  deep-leech:689168:689222 [6] NCCL INFO ncclPollProxyResponse Received new opId=0x7fa7a8f7a6d0
  deep-leech:689168:689222 [6] NCCL INFO resp.opId=0x7fa7a8f7a6d0 matches expected opId=0x7fa7a8f7a6d0
  deep-leech:689168:689222 [6] NCCL INFO sendConnect ncclPollProxyResponse opId=0x7fa7a8f7a6d0
  deep-leech:689168:689254 [4] NCCL INFO proxyProgressAsync opId=0x7fa7b0e21c90 op.type=4 op.reqBuff=0x7fa795f28e10 op.respSize=21040 done
  deep-leech:689168:689220 [4] NCCL INFO ncclPollProxyResponse Received new opId=0x7fa7b0e21c90
  deep-leech:689168:689220 [4] NCCL INFO resp.opId=0x7fa7b0e21c90 matches expected opId=0x7fa7b0e21c90
  deep-leech:689168:689220 [4] NCCL INFO sendConnect ncclPollProxyResponse opId=0x7fa7b0e21c90
  deep-leech:689168:689252 [3] NCCL INFO Allocated shareable buffer 0xc4c000000 size 10485760 ipcDesc 0x7fa79d13d440
  deep-leech:689168:689252 [3] NCCL INFO transport/net.cc:735 Cuda Host Alloc Size 532480 pointer 0xc41704000
  deep-leech:689168:689252 [3] NCCL INFO proxyProgressAsync opId=0x7fa7bcf12e98 op.type=4 op.reqBuff=0x7fa79e990ec0 op.respSize=21040 done
  deep-leech:689168:689219 [3] NCCL INFO ncclPollProxyResponse Received new opId=0x7fa7bcf12e98
  deep-leech:689168:689219 [3] NCCL INFO Queuing opId=0x7fa7bcf12e98 respBuff=0x7fa7bd14a790 respSize=21040
  deep-leech:689168:689219 [3] NCCL INFO ncclPollProxyResponse Dequeued cached opId=0x7fa7bcf12e98
  deep-leech:689168:689219 [3] NCCL INFO sendConnect ncclPollProxyResponse opId=0x7fa7bcf12e98
  deep-leech:689168:689252 [3] NCCL INFO Allocated shareable buffer 0xc4ca00000 size 9633792 ipcDesc 0x7fa79d1287e0
  deep-leech:689168:689252 [3] NCCL INFO transport/net.cc:883 Cuda Host Alloc Size 8192 pointer 0x7fb036d9c200
  deep-leech:689168:689252 [3] NCCL INFO proxyProgressAsync opId=0x7fa7bcf164b8 op.type=4 op.reqBuff=0x7fa79e773fe0 op.respSize=21040 done
  deep-leech:689168:689219 [3] NCCL INFO ncclPollProxyResponse Received new opId=0x7fa7bcf164b8
  deep-leech:689168:689219 [3] NCCL INFO resp.opId=0x7fa7bcf164b8 matches expected opId=0x7fa7bcf164b8
  deep-leech:689168:689219 [3] NCCL INFO recvConnect ncclPollProxyResponse opId=0x7fa7bcf164b8
  deep-leech:689168:689254 [4] NCCL INFO Allocated shareable buffer 0xc4d400000 size 10485760 ipcDesc 0x7fa79513e160
  deep-leech:689168:689254 [4] NCCL INFO transport/net.cc:735 Cuda Host Alloc Size 532480 pointer 0xc45504000
  deep-leech:689168:689255 [5] NCCL INFO Allocated shareable buffer 0xc4e000000 size 10485760 ipcDesc 0x7fa78913d3d0
  deep-leech:689168:689255 [5] NCCL INFO transport/net.cc:735 Cuda Host Alloc Size 532480 pointer 0xc48b04000
  deep-leech:689168:689254 [4] NCCL INFO proxyProgressAsync opId=0x7fa7b0f348c0 op.type=4 op.reqBuff=0x7fa796b0bdb0 op.respSize=21040 done
  deep-leech:689168:689220 [4] NCCL INFO ncclPollProxyResponse Received new opId=0x7fa7b0f348c0
  deep-leech:689168:689220 [4] NCCL INFO resp.opId=0x7fa7b0f348c0 matches expected opId=0x7fa7b0f348c0
  deep-leech:689168:689220 [4] NCCL INFO sendConnect ncclPollProxyResponse opId=0x7fa7b0f348c0
  deep-leech:689168:689255 [5] NCCL INFO proxyProgressAsync opId=0x7fa7cd2e8fa8 op.type=4 op.reqBuff=0x7fa78a865c70 op.respSize=21040 done
  deep-leech:689168:689221 [5] NCCL INFO ncclPollProxyResponse Received new opId=0x7fa7cd2e8fa8
  deep-leech:689168:689221 [5] NCCL INFO resp.opId=0x7fa7cd2e8fa8 matches expected opId=0x7fa7cd2e8fa8
  deep-leech:689168:689221 [5] NCCL INFO sendConnect ncclPollProxyResponse opId=0x7fa7cd2e8fa8
AddyLaddy commented 6 months ago

It's hard to see what could be wrong without seeing the whole log file. But can you check that any NCCL environment variables you are setting are being passed to all ranks not just rank/node 0. With mpirun that means passing them on the command line like -x NCCL_DEBUG=INFO for example.

Another technique could be to generate a log file per rank and then look for one that looks different or shorter than the rest with -x NCCL_DEBUG_FILE=nccl_log.%h.%p

chgdragon2023 commented 4 months ago

it turns out there is one "bad" server. With the server in the list, the nccl-test can only be run with less than 18 nodes. After excluding this server, the nccl-test can be run beyond 18 nodes.