Issues with Limited HCA Utilization and RDMA in Multi-node Training

asdfry commented 3 months ago

Hello,

I am testing multi-node training using Accelerate and DeepSpeed on a Kubernetes cluster. The nodes in my test setup are pnode4 (8x A100) and pnode14 (8x A100). While the training process runs smoothly, I have two concerns. I have attached the NCCL debug logs and the topology for reference.

Each node has 8 HCAs connected via InfiniBand. However, monitoring with Grafana shows that only mlx5_0, mlx5_2, mlx5_6, and mlx5_8 are being used. Why are only 4 HCAs used, and how can I utilize all 8?

The NCCL debug logs indicate "GPU Direct RDMA Disabled" for all HCAs. What additional settings are required to enable RDMA?

Thank you for your assistance.

nccl-debug.log nccl-topo.log

kiskra-nvidia commented 3 months ago

FYI, your debug log (which we greatly appreciate, BTW!) is incomplete -- I'm guessing there were multiple NCCL processes writing to it simultaneously, overwriting each other. If you want to dump directly to a file, the files need to be unique for each process. Or simply use the default writing to stdout and redirect it.

What I find weird in your debug output is that even though the IB SHARP plugin is being loaded, it says "Using network IB" (rather than the expected "IBext_v8"), as if the plugin declines to work on your system for some reason. Perhaps it is because of the issue with GPU Direct RDMA? Is the nvidia-peermem kernel module loaded? NCCL looks for /sys/kernel/mm/memory_peers/nv_mem/version or /sys/kernel/mm/memory_peers/nv_mem_nc/version to determine that -- make sure one of them is accessible, including via Kubernetes. With GPU Direct RDMA enabled, all HCAs should be used...

asdfry commented 3 months ago

As you mentioned, after enabling nvidia-peermem, RDMA seems to be enabled, but another error has occurred. Since the NCCL debug logs overlap across multiple processes, I set the environment variable like this and ran it: NCCL_DEBUG_FILE=/root/mnt/output/nccl-debug-%h-%p.log

When checking the logs on pnode14, I found a local catastrophic error. According to my search on Google, this error is said to be caused by a power shortage. Is this correct?

node4-node14.zip

asdfry commented 3 months ago

In order to use less power, I limited the GPU power to 150W on both nodes using nvidia-smi, and executed multi-node training using only 2 GPUs and 2 HCAs per node, but the results were the same. The strange thing is that after this error occurs, the used GPUs are not recognized by the host, necessitating a reboot.

node4-node14-2.zip

kiskra-nvidia commented 3 months ago

Sounds like it's possibly some hardware issue? Sorry, I don't know how we could help...

asdfry commented 3 months ago

I have one more question. I set the NCCL_IB_HCA environment variable to only use mlx5_0, mlx5_1, mlx5_2, and mlx5_3, so why is NCCL also trying to use mlx5_10 and mlx5_11? Unlike the other HCAs, mlx5_10 has a 100G cable connected and is connected to a different switch, while mlx5_11 is used for internal Ethernet communication. Therefore, I would like to avoid using these two HCAs. Is there any way to do that?

$ head -n 14 nccl-debug-pnode4-955.log
pnode4:955:955 [0] NCCL INFO NCCL_SOCKET_IFNAME set to bond0
pnode4:955:955 [0] NCCL INFO NCCL version 2.20.5+cuda12.4
pnode4:955:1065 [0] NCCL INFO Plugin Path : /root/hpcx-v2.19-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/nccl_rdma_sharp_plugin/lib/libnccl-net.so
pnode4:955:1065 [0] NCCL INFO P2P plugin IBext_v8
pnode4:955:1065 [0] NCCL INFO NCCL_SOCKET_IFNAME set to bond0
pnode4:955:1065 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_2:1/IB/SHARP [3]mlx5_3:1/IB/SHARP [4]mlx5_10:1/IB/SHARP [5]mlx5_11:1/RoCE [RO]; OOB bond0:192.168.1.4<0>
pnode4:955:1065 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3
pnode4:955:1065 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_10:1/IB [5]mlx5_11:1/RoCE [RO]; OOB bond0:192.168.1.4<0>
pnode4:955:1065 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 0 'mlx5_0'
pnode4:955:1065 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 1 'mlx5_1'
pnode4:955:1065 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 2 'mlx5_2'
pnode4:955:1065 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 3 'mlx5_3'
pnode4:955:1065 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 4 'mlx5_10'
pnode4:955:1065 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 5 'mlx5_11'

kiskra-nvidia commented 3 months ago

Please consult https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-ib-hca, especially the note in the last paragraph.

asdfry commented 3 months ago

Both servers used for training were fixed by installing them according to the documentation! Thank you for all your help up to now.

NVIDIA / nccl

Issues with Limited HCA Utilization and RDMA in Multi-node Training #1392