bug描述 Describe the Bug

问题

我使用一个简单的几层卷积demo，试图搭建双节点单卡的多机训练，想使用IB进行多机通信，但却卡在某一步无法进行。

环境

H800 8卡 * 2，ubuntu 22.04，paddle 2.6.1，cuda 12.1，nccl 2.17.1

env | grep NCCL：

NCCL_DEBUG_SUBSYS=ALL
NCCL_DEBUG=INFO
NCCL_IB_HCA=mlx5
NCCL_IB_DISABLE=0

双节点的ib网卡：

ens61np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 1.1.1.93  netmask 255.0.0.0  broadcast 1.255.255.255
        inet6 fe80::ba3f:d2ff:fe27:bcc6  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:27:bc:c6  txqueuelen 1000  (Ethernet)
        RX packets 23530  bytes 3351359 (3.3 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 34766  bytes 7169803 (7.1 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens69np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 2.2.2.93  netmask 255.0.0.0  broadcast 2.255.255.255
        inet6 fe80::ac0:ebff:fe50:ff42  prefixlen 64  scopeid 0x20<link>
        ether 08:c0:eb:50:ff:42  txqueuelen 1000  (Ethernet)
        RX packets 899  bytes 159670 (159.6 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 611  bytes 43510 (43.5 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens73np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 3.3.3.93  netmask 255.0.0.0  broadcast 3.255.255.255
        inet6 fe80::eaeb:d3ff:fe06:6d5c  prefixlen 64  scopeid 0x20<link>
        ether e8:eb:d3:06:6d:5c  txqueuelen 1000  (Ethernet)
        RX packets 552  bytes 130140 (130.1 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 317  bytes 21962 (21.9 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens78np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 4.4.4.93  netmask 255.0.0.0  broadcast 4.255.255.255
        inet6 fe80::eaeb:d3ff:fe06:6d58  prefixlen 64  scopeid 0x20<link>
        ether e8:eb:d3:06:6d:58  txqueuelen 1000  (Ethernet)
        RX packets 667  bytes 130768 (130.7 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 473  bytes 32722 (32.7 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

运行指令

python -m paddle.distributed.launch --gpus 0 --ips 1.1.1.93,1.1.1.94 train.py

运行结果

root@hygon-h800-01:/data1/ywl/multi-cards-test#  python -m paddle.distributed.launch --gpus 0 --ips 1.1.1.93,1.1.1.94 train.py
LAUNCH INFO 2024-06-24 16:27:07,917 -----------  Configuration  ----------------------
LAUNCH INFO 2024-06-24 16:27:07,917 auto_parallel_config: None
LAUNCH INFO 2024-06-24 16:27:07,917 auto_tuner_json: None
LAUNCH INFO 2024-06-24 16:27:07,917 devices: 0
LAUNCH INFO 2024-06-24 16:27:07,917 elastic_level: -1
LAUNCH INFO 2024-06-24 16:27:07,917 elastic_timeout: 30
LAUNCH INFO 2024-06-24 16:27:07,917 enable_gpu_log: True
LAUNCH INFO 2024-06-24 16:27:07,917 gloo_port: 6767
LAUNCH INFO 2024-06-24 16:27:07,917 host: None
LAUNCH INFO 2024-06-24 16:27:07,917 ips: 1.1.1.93,1.1.1.94
LAUNCH INFO 2024-06-24 16:27:07,917 job_id: default
LAUNCH INFO 2024-06-24 16:27:07,917 legacy: False
LAUNCH INFO 2024-06-24 16:27:07,917 log_dir: log
LAUNCH INFO 2024-06-24 16:27:07,917 log_level: INFO
LAUNCH INFO 2024-06-24 16:27:07,917 log_overwrite: False
LAUNCH INFO 2024-06-24 16:27:07,917 master: None
LAUNCH INFO 2024-06-24 16:27:07,917 max_restart: 3
LAUNCH INFO 2024-06-24 16:27:07,917 nnodes: 1
LAUNCH INFO 2024-06-24 16:27:07,917 nproc_per_node: None
LAUNCH INFO 2024-06-24 16:27:07,917 rank: -1
LAUNCH INFO 2024-06-24 16:27:07,917 run_mode: collective
LAUNCH INFO 2024-06-24 16:27:07,917 server_num: None
LAUNCH INFO 2024-06-24 16:27:07,917 servers:
LAUNCH INFO 2024-06-24 16:27:07,917 sort_ip: False
LAUNCH INFO 2024-06-24 16:27:07,917 start_port: 6070
LAUNCH INFO 2024-06-24 16:27:07,917 trainer_num: None
LAUNCH INFO 2024-06-24 16:27:07,917 trainers:
LAUNCH INFO 2024-06-24 16:27:07,917 training_script: train.py
LAUNCH INFO 2024-06-24 16:27:07,917 training_script_args: []
LAUNCH INFO 2024-06-24 16:27:07,917 with_gloo: 1
LAUNCH INFO 2024-06-24 16:27:07,917 --------------------------------------------------
LAUNCH INFO 2024-06-24 16:27:07,918 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2024-06-24 16:27:07,919 Run Pod: ditmyj, replicas 1, status ready
LAUNCH INFO 2024-06-24 16:27:07,928 Watching Pod: ditmyj, replicas 1, status running
[2024-06-24 16:27:09,048] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I0624 16:27:09.048821 703010 tcp_utils.cc:181] The server starts to listen on IP_ANY:6070
I0624 16:27:09.048971 703010 tcp_utils.cc:130] Successfully connected to 1.1.1.93:6070
I0624 16:27:11.633677 703010 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
[2024-06-24 16:27:11,634] [    INFO] topology.py:358 - Total 2 pipe comm group(s) create successfully!
W0624 16:27:11.638200 703010 gpu_resources.cc:106] The GPU compute capability in your current machine is 90, which is not supported by Paddle, it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website.
W0624 16:27:11.638235 703010 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 9.0, Driver API Version: 12.2, Runtime API Version: 11.8
W0624 16:27:11.639724 703010 gpu_resources.cc:164] device: 0, cuDNN Version: 9.2.
/usr/local/lib/python3.10/dist-packages/paddle/distributed/communication/group.py:114: UserWarning: Current global rank 0 is not in group _default_pg10
  warnings.warn(
[2024-06-24 16:27:11,875] [    INFO] topology.py:358 - Total 2 data comm group(s) create successfully!
I0624 16:27:11.875936 703010 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
[2024-06-24 16:27:11,875] [    INFO] topology.py:358 - Total 1 model comm group(s) create successfully!
[2024-06-24 16:27:11,876] [    INFO] topology.py:358 - Total 2 sharding comm group(s) create successfully!
I0624 16:27:11.876040 703010 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
I0624 16:27:11.876056 703010 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000
[2024-06-24 16:27:11,876] [    INFO] topology.py:288 - HybridParallelInfo: rank_id: 0, mp_degree: 2, sharding_degree: 1, pp_degree: 1, dp_degree: 1, sep_degree: 1, mp_group: [0, 1],  sharding_group: [0], pp_group: [0], dp_group: [0], sep:group: None, check/clip group: [0, 1]
start distribute model
[2024-06-24 16:27:12,033] [    INFO] tensor_parallel.py:33 - start broadcast mp parameters
hygon-h800-01:703010:703010 [0] NCCL INFO Bootstrap : Using ens61np0:1.1.1.93<0>
hygon-h800-01:703010:703010 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
hygon-h800-01:703010:703010 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
hygon-h800-01:703010:703010 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.17.1+cuda12.3
hygon-h800-01:703010:703010 [0] NCCL INFO init.cc:1301 Cuda Host Alloc Size 4 pointer 0x7fe874400000
hygon-h800-01:703010:703158 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
hygon-h800-01:703010:703158 [0] NCCL INFO NCCL_IB_HCA set to mlx5
hygon-h800-01:703010:703158 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB ens61np0:1.1.1.93<0>
hygon-h800-01:703010:703158 [0] NCCL INFO Using network IB

最终，卡在这里无法继续向下进行。

其他补充信息 Additional Supplementary Information

No response

PaddlePaddle / Paddle

在H800上开启RDMA的双节点分布式训练相关问题 #65419

bug描述 Describe the Bug

问题

环境

运行指令

运行结果

其他补充信息 Additional Supplementary Information