Closed asdfry closed 2 months ago
FYI, your debug log (which we greatly appreciate, BTW!) is incomplete -- I'm guessing there were multiple NCCL processes writing to it simultaneously, overwriting each other. If you want to dump directly to a file, the files need to be unique for each process. Or simply use the default writing to stdout and redirect it.
What I find weird in your debug output is that even though the IB SHARP plugin is being loaded, it says "Using network IB" (rather than the expected "IBext_v8"), as if the plugin declines to work on your system for some reason. Perhaps it is because of the issue with GPU Direct RDMA? Is the nvidia-peermem
kernel module loaded? NCCL looks for /sys/kernel/mm/memory_peers/nv_mem/version
or /sys/kernel/mm/memory_peers/nv_mem_nc/version
to determine that -- make sure one of them is accessible, including via Kubernetes. With GPU Direct RDMA enabled, all HCAs should be used...
As you mentioned, after enabling nvidia-peermem, RDMA seems to be enabled, but another error has occurred. Since the NCCL debug logs overlap across multiple processes, I set the environment variable like this and ran it: NCCL_DEBUG_FILE=/root/mnt/output/nccl-debug-%h-%p.log
When checking the logs on pnode14, I found a local catastrophic error
. According to my search on Google, this error is said to be caused by a power shortage. Is this correct?
In order to use less power, I limited the GPU power to 150W on both nodes using nvidia-smi
, and executed multi-node training using only 2 GPUs and 2 HCAs per node, but the results were the same.
The strange thing is that after this error occurs, the used GPUs are not recognized by the host, necessitating a reboot.
Sounds like it's possibly some hardware issue? Sorry, I don't know how we could help...
I have one more question.
I set the NCCL_IB_HCA
environment variable to only use mlx5_0
, mlx5_1
, mlx5_2
, and mlx5_3
, so why is NCCL also trying to use mlx5_10
and mlx5_11
?
Unlike the other HCAs, mlx5_10
has a 100G cable connected and is connected to a different switch, while mlx5_11
is used for internal Ethernet communication.
Therefore, I would like to avoid using these two HCAs. Is there any way to do that?
$ head -n 14 nccl-debug-pnode4-955.log
pnode4:955:955 [0] NCCL INFO NCCL_SOCKET_IFNAME set to bond0
pnode4:955:955 [0] NCCL INFO NCCL version 2.20.5+cuda12.4
pnode4:955:1065 [0] NCCL INFO Plugin Path : /root/hpcx-v2.19-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/nccl_rdma_sharp_plugin/lib/libnccl-net.so
pnode4:955:1065 [0] NCCL INFO P2P plugin IBext_v8
pnode4:955:1065 [0] NCCL INFO NCCL_SOCKET_IFNAME set to bond0
pnode4:955:1065 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_2:1/IB/SHARP [3]mlx5_3:1/IB/SHARP [4]mlx5_10:1/IB/SHARP [5]mlx5_11:1/RoCE [RO]; OOB bond0:192.168.1.4<0>
pnode4:955:1065 [0] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_1,mlx5_2,mlx5_3
pnode4:955:1065 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_10:1/IB [5]mlx5_11:1/RoCE [RO]; OOB bond0:192.168.1.4<0>
pnode4:955:1065 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 0 'mlx5_0'
pnode4:955:1065 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 1 'mlx5_1'
pnode4:955:1065 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 2 'mlx5_2'
pnode4:955:1065 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 3 'mlx5_3'
pnode4:955:1065 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 4 'mlx5_10'
pnode4:955:1065 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for HCA 5 'mlx5_11'
Please consult https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-ib-hca, especially the note in the last paragraph.
Both servers used for training were fixed by installing them according to the documentation! Thank you for all your help up to now.
Hello,
I am testing multi-node training using Accelerate and DeepSpeed on a Kubernetes cluster. The nodes in my test setup are pnode4 (8x A100) and pnode14 (8x A100). While the training process runs smoothly, I have two concerns. I have attached the NCCL debug logs and the topology for reference.
Each node has 8 HCAs connected via InfiniBand. However, monitoring with Grafana shows that only mlx5_0, mlx5_2, mlx5_6, and mlx5_8 are being used. Why are only 4 HCAs used, and how can I utilize all 8?
The NCCL debug logs indicate "GPU Direct RDMA Disabled" for all HCAs. What additional settings are required to enable RDMA?
Thank you for your assistance.
nccl-debug.log nccl-topo.log