GPU Direct RDMA Disabled for HCA

hiennguyennq commented 2 weeks ago

I train distributed on 2 nodes 8 H100 use repo llama-factory

scripts: Node 1: export NCCL_IB_GID_INDEX=4 export NCCL_IB_HCA=mlx5 export NCCL_IB_DISABLE=0 export NCCL_DEBUG=INFO export NCCL_DEBUG_SUBSYS=ALL export NCCL_SOCKET_IFNAME=net1 FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=0 MASTER_ADDR=10.12.0.21 MASTER_PORT=29550 llamafactory-cli train examples/train_full/llama3_full_sft_ds3.yaml > training_log.txt 2>&1

Node 2: export NCCL_IB_GID_INDEX=4 export NCCL_IB_HCA=mlx5 export NCCL_IB_DISABLE=0 export NCCL_DEBUG=INFO export NCCL_DEBUG_SUBSYS=ALL export NCCL_SOCKET_IFNAME=net1 FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=1 MASTER_ADDR=10.12.0.21 MASTER_PORT=29550 llamafactory-cli train examples/t rain_full/llama3_full_sft_ds3.yaml > training_log.txt 2>&1

show_gids DEV PORT INDEX GID IPv4 VER DEV

mlx5_0 1 4 0000:0000:0000:0000:0000:ffff:0a0c:0015 10.12.0.21 v1 net1 mlx5_0 1 5 0000:0000:0000:0000:0000:ffff:0a0c:0015 10.12.0.21 v2 net1 mlx5_0 1 6 fe80:0000:0000:0000:9cbb:b7ff:feb3:937a v1 net1 mlx5_0 1 7 fe80:0000:0000:0000:9cbb:b7ff:feb3:937a v2 net1 mlx5_1 1 4 0000:0000:0000:0000:0000:ffff:0a0e:0013 10.14.0.19 v1 net2 mlx5_1 1 5 0000:0000:0000:0000:0000:ffff:0a0e:0013 10.14.0.19 v2 net2 mlx5_1 1 6 fe80:0000:0000:0000:187d:40ff:fea6:b7d0 v1 net2 mlx5_1 1 7 fe80:0000:0000:0000:187d:40ff:fea6:b7d0 v2 net2 mlx5_2 1 4 0000:0000:0000:0000:0000:ffff:0a10:0013 10.16.0.19 v1 net3 mlx5_2 1 5 0000:0000:0000:0000:0000:ffff:0a10:0013 10.16.0.19 v2 net3 mlx5_2 1 6 fe80:0000:0000:0000:c8a8:63ff:fe1f:6594 v1 net3 mlx5_2 1 7 fe80:0000:0000:0000:c8a8:63ff:fe1f:6594 v2 net3 mlx5_3 1 4 0000:0000:0000:0000:0000:ffff:0a12:0013 10.18.0.19 v1 net4 mlx5_3 1 5 0000:0000:0000:0000:0000:ffff:0a12:0013 10.18.0.19 v2 net4 mlx5_3 1 6 fe80:0000:0000:0000:0051:fbff:fe24:939e v1 net4 mlx5_3 1 7 fe80:0000:0000:0000:0051:fbff:fe24:939e v2 net4 n_gids_found=16

file log training_log.txt training_log.txt

I check log and see: alt-duc-20241028-02-d624b548-db95-4d73-b0d6-7550b86abdf1-dr6d7v:1494756:1495110 [3] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' alt-duc-20241028-02-d624b548-db95-4d73-b0d6-7550b86abdf1-dr6d7v:1494756:1495110 [3] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' alt-duc-20241028-02-d624b548-db95-4d73-b0d6-7550b86abdf1-dr6d7v:1494756:1495110 [3] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' alt-duc-20241028-02-d624b548-db95-4d73-b0d6-7550b86abdf1-dr6d7v:1494756:1495110 [3] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3' alt-duc-20241028-02-d624b548-db95-4d73-b0d6-7550b86abdf1-dr6d7v:1494756:1495110 [3] NCCL INFO transport/p2p.cc:169 Cuda Alloc Size 2097152 pointer 0x7f2669a00000 alt-duc-20241028-02-d624b548-db95-4d73-b0d6-7550b86abdf1-dr6d7v:1494755:1495218 [2] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' alt-duc-20241028-02-d624b548-db95-4d73-b0d6-7550b86abdf1-dr6d7v:1494755:1495218 [2] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 1 'mlx5_1' alt-duc-20241028-02-d624b548-db95-4d73-b0d6-7550b86abdf1-dr6d7v:1494755:1495218 [2] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 2 'mlx5_2' alt-duc-20241028-02-d624b548-db95-4d73-b0d6-7550b86abdf1-dr6d7v:1494755:1495218 [2] NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 3 'mlx5_3'

How can I fix the bugs?

AddyLaddy commented 2 weeks ago

Firstly, I'd start with something much simpler like the nccl-tests before throwing llama at a new machine. But a quick scan of those log files has this message:

misc/cudawrap.cc:188 NCCL WARN Failed to find CUDA library libcuda.so (NCCL_CUDA_PATH='') :

So maybe your CUDA installation is incorrect?

I also see messages such as:

bootstrap.cc:77 NCCL WARN Message truncated : received 1024 bytes instead of 256

Which is often caused by different versions of the NCCL library being run on each node.

hiennguyennq commented 2 weeks ago

@AddyLaddy thanks for your help. i was add the scripts: export PATH=/usr/local/cuda-12.4/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

But then in the log has warning: NCCL WARN Cuda failure 3 'initialization error'

node 2: training_log.txt

hiennguyennq commented 2 weeks ago

@AddyLaddy and i also check nccl version. I don't know where is wrong.

NVIDIA / nccl

GPU Direct RDMA Disabled for HCA #1510