NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
876 stars 238 forks source link

Running in kubernetes pods Error #248

Closed drikster80 closed 1 month ago

drikster80 commented 2 months ago

PROBLEM: I'm attempting to run nccl-tests from within kubernetes pods in order to test different RDMA/GPUDirect performances across kubernetes.

ENVIRONMENT: Physical Nodes: GH200 (Grace Hopper/arm64) NICs: Bluefield-3 Vanilla Kubernetes: 1.31 GPU Operator & Nvidia Network Operator installed and setup

I'm using the Nvidia Pytorch container that has NCCL and MPI pre-installed: nvcr.io/nvidia/pytorch:24.08-py3

Compiled with:

git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI=1 MPI_HOME=/usr/local/mpi/ CUDA_HOME=/usr/local/cuda/

After setting up SSH between the pods and verifying connectivity, launching with:

mpirun -np 2 -host localhost,192.168.232.219 --allow-run-as-root $PWD/build/sendrecv_perf -b 8 -e 1024M -f 2 -g 1

ERROR:

# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   9688 on demo-pod-1 device  0 [0x01] NVIDIA GH200 480GB
#  Rank  1 Group  0 Pid   9991 on demo-pod-2 device  0 [0x01] NVIDIA GH200 480GB
demo-pod-1:9688:9688 [0] NCCL INFO Bootstrap : Using eth0:192.168.107.33<0>
demo-pod-1:9688:9688 [0] NCCL INFO cudaDriverVersion 12060
demo-pod-1:9688:9688 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
demo-pod-1:9688:9696 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
demo-pod-1:9688:9696 [0] NCCL INFO P2P plugin IBext_v8
demo-pod-1:9688:9696 [0] NCCL INFO NET/IB : Using [0]={[0] mlx5_0:1/RoCE, [1] mlx5_1:1/RoCE} [RO]; OOB eth0:192.168.107.33<0>
demo-pod-1:9688:9696 [0] NCCL INFO Using network IBext_v8
demo-pod-1:9688:9696 [0] NCCL INFO DMA-BUF is available on GPU device 0
demo-pod-1:9688:9696 [0] NCCL INFO ncclCommInitRank comm 0xb67e78a6cc40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x94704d8cb2f7fab5 - Init START
demo-pod-1:9688:9696 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff
demo-pod-1:9688:9696 [0] NCCL INFO comm 0xb67e78a6cc40 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
demo-pod-1:9688:9696 [0] NCCL INFO Channel 00/04 :    0   1
demo-pod-1:9688:9696 [0] NCCL INFO Channel 01/04 :    0   1
demo-pod-1:9688:9696 [0] NCCL INFO Channel 02/04 :    0   1
demo-pod-1:9688:9696 [0] NCCL INFO Channel 03/04 :    0   1
demo-pod-1:9688:9696 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1
demo-pod-1:9688:9696 [0] NCCL INFO P2P Chunksize set to 131072
demo-pod-1:9688:9696 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
demo-pod-1:9688:9696 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
demo-pod-1:9688:9696 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
demo-pod-1:9688:9696 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
demo-pod-1:9688:9696 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
demo-pod-1:9688:9696 [0] NCCL INFO ncclCommInitRank comm 0xb67e78a6cc40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 901000 commId 0x94704d8cb2f7fab5 - Init COMPLETE
demo-pod-1:9688:9696 [0] NCCL INFO Init timings: rank 0 nranks 2 total 1.18 (kernels 0.10, bootstrap 1.03, allgathers 0.01, topo 0.02, graphs 0.00, connections 0.01, rest 0.00)
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
demo-pod-1:9688:9701 [0] NCCL INFO Channel 01/1 : 1[0] -> 0[0] [receive] via NET/IBext_v8/0/Shared
demo-pod-1:9688:9701 [0] NCCL INFO Channel 03/1 : 1[0] -> 0[0] [receive] via NET/IBext_v8/0/Shared
demo-pod-1:9688:9701 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[0] [send] via NET/IBext_v8/0/Shared
demo-pod-1:9688:9701 [0] NCCL INFO Channel 03/1 : 0[0] -> 1[0] [send] via NET/IBext_v8/0/Shared
demo-pod-1:9688:9699 [0] NCCL INFO transport/net.cc:700 -> 2
demo-pod-1:9688:9701 [0] NCCL INFO transport.cc:166 -> 2
demo-pod-1:9688:9701 [0] NCCL INFO group.cc:128 -> 2
demo-pod-1:9688:9701 [0] NCCL INFO group.cc:70 -> 2 [Async thread]
demo-pod-1:9688:9688 [0] NCCL INFO group.cc:420 -> 2
demo-pod-1:9688:9688 [0] NCCL INFO group.cc:546 -> 2
demo-pod-1:9688:9699 [0] NCCL INFO proxy.cc:1377 -> 3

demo-pod-1:9688:9699 [0] proxy.cc:1521 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 3
demo-pod-1:9688:9688 [0] NCCL INFO group.cc:101 -> 2
demo-pod-1: Test NCCL failure sendrecv.cu:57 'unhandled system error (run with NCCL_DEBUG=INFO for details) / '
 .. demo-pod-1 pid 9688: Test failure common.cu:381
 .. demo-pod-1 pid 9688: Test failure common.cu:590
 .. demo-pod-1 pid 9688: Test failure sendrecv.cu:103
 .. demo-pod-1 pid 9688: Test failure common.cu:623
 .. demo-pod-1 pid 9688: Test failure common.cu:1078
 .. demo-pod-1 pid 9688: Test failure common.cu:891

demo-pod-1:9688:9699 [60540] include/alloc.h:261 NCCL WARN Cuda failure 'driver shutting down'
demo-pod-1:9688:9699 [909261416] NCCL INFO transport/net.cc:973 -> 1
demo-pod-1:9688:9699 [-128] NCCL INFO proxy.cc:925 -> 1
demo-pod-1:9688:9699 [-128] NCCL INFO proxy.cc:941 -> 1
demo-pod-2: Test NCCL failure sendrecv.cu:57 'unhandled system error (run with NCCL_DEBUG=INFO for details) / '
 .. demo-pod-2 pid 9991: Test failure common.cu:381
 .. demo-pod-2 pid 9991: Test failure common.cu:590
 .. demo-pod-2 pid 9991: Test failure sendrecv.cu:103
 .. demo-pod-2 pid 9991: Test failure common.cu:623
 .. demo-pod-2 pid 9991: Test failure common.cu:1078
 .. demo-pod-2 pid 9991: Test failure common.cu:891
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[44167,1],1]
  Exit code:    3
--------------------------------------------------------------------------

Containers were created with:

apiVersion: v1
kind: Pod
metadata:
  name: demo-pod-1
  annotations:
    k8s.v1.cni.cncf.io/networks: demo-macvlannetwork,demo-macvlannetwork-2
spec:
  nodeSelector:
    # Note: Replace hostname or remove selector altogether
    kubernetes.io/hostname: gh200-2
  restartPolicy: OnFailure
  containers:
  - image: nvcr.io/nvidia/pytorch:24.08-py3
    name: pytorch-2408
    command: ["/bin/sh", "-c"]
    args: ["sleep 5000"]
    securityContext:
      capabilities:
        add: [ "IPC_LOCK" ]
    resources:
      limits:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_a: 1
      requests:
        nvidia.com/gpu: 1
        rdma/rdma_shared_device_a: 1
kiskra-nvidia commented 2 months ago

Have you verified the connectivity using the RoCE network, or only TCP/IP? Please consult the Troubleshooting section of the NCCL docs, especially https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#rdma-over-converged-ethernet-roce.

drikster80 commented 1 month ago

@kiskra-nvidia, thanks or the response. I was able to get it working. I was not setting up the RDMA and GPUDirect correctly per Nvidia's instructions. Also had Istio injection turned on in the namespace, which was attempting to put a proxy in-between the nodes.