Open TimeYWL opened 3 months ago
我使用一个简单的几层卷积demo,试图搭建双节点单卡的多机训练,想使用IB进行多机通信,但却卡在某一步无法进行。
H800 8卡 * 2,ubuntu 22.04,paddle 2.6.1,cuda 12.1,nccl 2.17.1
env | grep NCCL:
NCCL_DEBUG_SUBSYS=ALL NCCL_DEBUG=INFO NCCL_IB_HCA=mlx5 NCCL_IB_DISABLE=0
双节点的ib网卡:
ens61np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 1.1.1.93 netmask 255.0.0.0 broadcast 1.255.255.255 inet6 fe80::ba3f:d2ff:fe27:bcc6 prefixlen 64 scopeid 0x20<link> ether b8:3f:d2:27:bc:c6 txqueuelen 1000 (Ethernet) RX packets 23530 bytes 3351359 (3.3 MB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 34766 bytes 7169803 (7.1 MB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens69np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 2.2.2.93 netmask 255.0.0.0 broadcast 2.255.255.255 inet6 fe80::ac0:ebff:fe50:ff42 prefixlen 64 scopeid 0x20<link> ether 08:c0:eb:50:ff:42 txqueuelen 1000 (Ethernet) RX packets 899 bytes 159670 (159.6 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 611 bytes 43510 (43.5 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens73np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 3.3.3.93 netmask 255.0.0.0 broadcast 3.255.255.255 inet6 fe80::eaeb:d3ff:fe06:6d5c prefixlen 64 scopeid 0x20<link> ether e8:eb:d3:06:6d:5c txqueuelen 1000 (Ethernet) RX packets 552 bytes 130140 (130.1 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 317 bytes 21962 (21.9 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens78np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 4.4.4.93 netmask 255.0.0.0 broadcast 4.255.255.255 inet6 fe80::eaeb:d3ff:fe06:6d58 prefixlen 64 scopeid 0x20<link> ether e8:eb:d3:06:6d:58 txqueuelen 1000 (Ethernet) RX packets 667 bytes 130768 (130.7 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 473 bytes 32722 (32.7 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
python -m paddle.distributed.launch --gpus 0 --ips 1.1.1.93,1.1.1.94 train.py
root@hygon-h800-01:/data1/ywl/multi-cards-test# python -m paddle.distributed.launch --gpus 0 --ips 1.1.1.93,1.1.1.94 train.py LAUNCH INFO 2024-06-24 16:27:07,917 ----------- Configuration ---------------------- LAUNCH INFO 2024-06-24 16:27:07,917 auto_parallel_config: None LAUNCH INFO 2024-06-24 16:27:07,917 auto_tuner_json: None LAUNCH INFO 2024-06-24 16:27:07,917 devices: 0 LAUNCH INFO 2024-06-24 16:27:07,917 elastic_level: -1 LAUNCH INFO 2024-06-24 16:27:07,917 elastic_timeout: 30 LAUNCH INFO 2024-06-24 16:27:07,917 enable_gpu_log: True LAUNCH INFO 2024-06-24 16:27:07,917 gloo_port: 6767 LAUNCH INFO 2024-06-24 16:27:07,917 host: None LAUNCH INFO 2024-06-24 16:27:07,917 ips: 1.1.1.93,1.1.1.94 LAUNCH INFO 2024-06-24 16:27:07,917 job_id: default LAUNCH INFO 2024-06-24 16:27:07,917 legacy: False LAUNCH INFO 2024-06-24 16:27:07,917 log_dir: log LAUNCH INFO 2024-06-24 16:27:07,917 log_level: INFO LAUNCH INFO 2024-06-24 16:27:07,917 log_overwrite: False LAUNCH INFO 2024-06-24 16:27:07,917 master: None LAUNCH INFO 2024-06-24 16:27:07,917 max_restart: 3 LAUNCH INFO 2024-06-24 16:27:07,917 nnodes: 1 LAUNCH INFO 2024-06-24 16:27:07,917 nproc_per_node: None LAUNCH INFO 2024-06-24 16:27:07,917 rank: -1 LAUNCH INFO 2024-06-24 16:27:07,917 run_mode: collective LAUNCH INFO 2024-06-24 16:27:07,917 server_num: None LAUNCH INFO 2024-06-24 16:27:07,917 servers: LAUNCH INFO 2024-06-24 16:27:07,917 sort_ip: False LAUNCH INFO 2024-06-24 16:27:07,917 start_port: 6070 LAUNCH INFO 2024-06-24 16:27:07,917 trainer_num: None LAUNCH INFO 2024-06-24 16:27:07,917 trainers: LAUNCH INFO 2024-06-24 16:27:07,917 training_script: train.py LAUNCH INFO 2024-06-24 16:27:07,917 training_script_args: [] LAUNCH INFO 2024-06-24 16:27:07,917 with_gloo: 1 LAUNCH INFO 2024-06-24 16:27:07,917 -------------------------------------------------- LAUNCH INFO 2024-06-24 16:27:07,918 Job: default, mode collective, replicas 1[1:1], elastic False LAUNCH INFO 2024-06-24 16:27:07,919 Run Pod: ditmyj, replicas 1, status ready LAUNCH INFO 2024-06-24 16:27:07,928 Watching Pod: ditmyj, replicas 1, status running [2024-06-24 16:27:09,048] [ INFO] distributed_strategy.py:214 - distributed strategy initialized ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='') ======================================================================= I0624 16:27:09.048821 703010 tcp_utils.cc:181] The server starts to listen on IP_ANY:6070 I0624 16:27:09.048971 703010 tcp_utils.cc:130] Successfully connected to 1.1.1.93:6070 I0624 16:27:11.633677 703010 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000 [2024-06-24 16:27:11,634] [ INFO] topology.py:358 - Total 2 pipe comm group(s) create successfully! W0624 16:27:11.638200 703010 gpu_resources.cc:106] The GPU compute capability in your current machine is 90, which is not supported by Paddle, it is recommended to install the corresponding wheel package according to the installation information on the official Paddle website. W0624 16:27:11.638235 703010 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 9.0, Driver API Version: 12.2, Runtime API Version: 11.8 W0624 16:27:11.639724 703010 gpu_resources.cc:164] device: 0, cuDNN Version: 9.2. /usr/local/lib/python3.10/dist-packages/paddle/distributed/communication/group.py:114: UserWarning: Current global rank 0 is not in group _default_pg10 warnings.warn( [2024-06-24 16:27:11,875] [ INFO] topology.py:358 - Total 2 data comm group(s) create successfully! I0624 16:27:11.875936 703010 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000 [2024-06-24 16:27:11,875] [ INFO] topology.py:358 - Total 1 model comm group(s) create successfully! [2024-06-24 16:27:11,876] [ INFO] topology.py:358 - Total 2 sharding comm group(s) create successfully! I0624 16:27:11.876040 703010 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000 I0624 16:27:11.876056 703010 process_group_nccl.cc:129] ProcessGroupNCCL pg_timeout_ 1800000 [2024-06-24 16:27:11,876] [ INFO] topology.py:288 - HybridParallelInfo: rank_id: 0, mp_degree: 2, sharding_degree: 1, pp_degree: 1, dp_degree: 1, sep_degree: 1, mp_group: [0, 1], sharding_group: [0], pp_group: [0], dp_group: [0], sep:group: None, check/clip group: [0, 1] start distribute model [2024-06-24 16:27:12,033] [ INFO] tensor_parallel.py:33 - start broadcast mp parameters hygon-h800-01:703010:703010 [0] NCCL INFO Bootstrap : Using ens61np0:1.1.1.93<0> hygon-h800-01:703010:703010 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory hygon-h800-01:703010:703010 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation hygon-h800-01:703010:703010 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.17.1+cuda12.3 hygon-h800-01:703010:703010 [0] NCCL INFO init.cc:1301 Cuda Host Alloc Size 4 pointer 0x7fe874400000 hygon-h800-01:703010:703158 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0. hygon-h800-01:703010:703158 [0] NCCL INFO NCCL_IB_HCA set to mlx5 hygon-h800-01:703010:703158 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB ens61np0:1.1.1.93<0> hygon-h800-01:703010:703158 [0] NCCL INFO Using network IB
最终,卡在这里无法继续向下进行。
No response
您好,从log上看,master: None nnodes: 1 rank: -1 这些都有点问题,应该是被当单机训练了。建议先看看运行方式和命令。
bug描述 Describe the Bug
问题
我使用一个简单的几层卷积demo,试图搭建双节点单卡的多机训练,想使用IB进行多机通信,但却卡在某一步无法进行。
环境
H800 8卡 * 2,ubuntu 22.04,paddle 2.6.1,cuda 12.1,nccl 2.17.1
env | grep NCCL:
双节点的ib网卡:
运行指令
python -m paddle.distributed.launch --gpus 0 --ips 1.1.1.93,1.1.1.94 train.py
运行结果
最终,卡在这里无法继续向下进行。
其他补充信息 Additional Supplementary Information
No response