Closed drikster80 closed 1 month ago
Have you verified the connectivity using the RoCE network, or only TCP/IP? Please consult the Troubleshooting section of the NCCL docs, especially https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#rdma-over-converged-ethernet-roce.
@kiskra-nvidia, thanks or the response. I was able to get it working. I was not setting up the RDMA and GPUDirect correctly per Nvidia's instructions. Also had Istio injection turned on in the namespace, which was attempting to put a proxy in-between the nodes.
PROBLEM: I'm attempting to run nccl-tests from within kubernetes pods in order to test different RDMA/GPUDirect performances across kubernetes.
ENVIRONMENT: Physical Nodes: GH200 (Grace Hopper/arm64) NICs: Bluefield-3 Vanilla Kubernetes: 1.31 GPU Operator & Nvidia Network Operator installed and setup
I'm using the Nvidia Pytorch container that has NCCL and MPI pre-installed:
nvcr.io/nvidia/pytorch:24.08-py3
Compiled with:
After setting up SSH between the pods and verifying connectivity, launching with:
ERROR:
Containers were created with: