alibaba / Pai-Megatron-Patch

The official repo of Pai-Megatron-Patch for LLM & VLM large scale training developed by Alibaba Cloud.
Apache License 2.0
674 stars 94 forks source link

打扰了,提个关于多机训练的issues #307

Closed CallmeZhangChenchen closed 2 months ago

CallmeZhangChenchen commented 2 months ago

两台机器进行多机训练

bond0.137: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.25.117.11 netmask 255.255.255.0 broadcast 10.25.117.255 inet6 fe80::268a:7ff:feb7:4df2 prefixlen 64 scopeid 0x20 ether 24:8a:07:b7:4d:f2 txqueuelen 1000 (Ethernet) RX packets 88954399 bytes 674039906438 (627.7 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 99697407 bytes 823309078853 (766.7 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255 inet6 fe80::42:9cff:fe1a:15a3 prefixlen 64 scopeid 0x20 ether 02:42:9c:1a:15:a3 txqueuelen 0 (Ethernet) RX packets 54407914 bytes 40798973786 (37.9 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 22179456 bytes 110587186015 (102.9 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0


10.25.112.55

bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500 inet6 fe80::6efe:54ff:fe6b:a44c prefixlen 64 scopeid 0x20 ether 6c:fe:54:6b:a4:4c txqueuelen 1000 (Ethernet) RX packets 241030415 bytes 317399195309 (295.6 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 153855256 bytes 148307704917 (138.1 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

bond0.133: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.25.112.55 netmask 255.255.254.0 broadcast 10.25.113.255 inet6 fe80::6efe:54ff:fe6b:a44c prefixlen 64 scopeid 0x20 ether 6c:fe:54:6b:a4:4c txqueuelen 1000 (Ethernet) RX packets 79280870 bytes 307355975852 (286.2 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 83545680 bytes 144470622113 (134.5 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.28.0.1 netmask 255.255.0.0 broadcast 172.28.255.255 inet6 fe80::42:3bff:febd:e88d prefixlen 64 scopeid 0x20 ether 02:42:3b:bd:e8:8d txqueuelen 0 (Ethernet) RX packets 38467968 bytes 33168475337 (30.8 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 11578107 bytes 8172382370 (7.6 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0


 - 容器启动

docker run -it --gpus all --network host -p 55554:55554 **


 - 脚本配置
10.25.117.11

ENV=$1 if [ $ENV = dsw ]; then export CUDA_VISIBLE_DEVICES=0,1,2,3 MASTER_ADDR=10.25.117.11 MASTER_PORT=55554 NNODES=2 NODE_RANK=0 GPUS_PER_NODE=4

10.25.112.55

ENV=$1 if [ $ENV = dsw ]; then export CUDA_VISIBLE_DEVICES=0,1,2,3 MASTER_ADDR=10.25.117.11 MASTER_PORT=55554 NNODES=2 NODE_RANK=1 GPUS_PER_NODE=4


 - 训练流程:
```shell
export NCCL_SOCKET_IFNAME=bond0
sh run_finetune_qwen.sh  dsw  0.5B   1    8 1e-5   1e-6   128  128  bf16  2   1  1 sel  true   false false   true   100  ./qwen-datasets/alpaca_zh-qwen-train.json   ./qwen-datasets/alpaca_zh-qwen-valid.json   ./qwen-ckpts/Qwen2-0.5B-to-mcore-tp2-pp1-ep1/   1000   10   ./output_mcore_qwen

报错原因分析: 10.25.117.11在连接10.25.112.55的时候失败: Connect to fe80::6efe:54ff:fe6b:a44c%usb0<12103> 但是 ping6 fe80::6efe:54ff:fe6b:a44c%bond0 能通 10.25.112.55在连接10.25.117.11的时候失败: Connect to fe80::268a:7ff:feb7:4df2%bond0.133<19491> 但是 ping6 fe80::268a:7ff:feb7:4df2%bond0能通

不知道为什么开始训练的时候,通过ipv6连接的时候 % 后面的 bond0 对不上,我设置了环境变量 export NCCL_SOCKET_IFNAME=bond0 不起作用

CallmeZhangChenchen commented 2 months ago

当打开 export NCCL_DEBUG=INFO 时,会看到跟上面报错类似的信息

l117-11-p-ga:3330:5736 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::6efe:54ff:fe6b:a44c%usb0<25999> failed : Network is unreachable l112-55-p-ga:3103:5301 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::268a:7ff:feb7:4df2%bond0.133<49025> failed : Software caused connection abort

看来还是我的网络有什么问题, 希望能给予一点指点

使用的镜像 nvcr.io/nvidia/pytorch:xx.xx-py3,什么都是默认的没有改什么东西,

CallmeZhangChenchen commented 2 months ago

应该是我这两台机器的问题

在同网段的两台机器上测试,训起来了

yuanzhiyong1999 commented 1 week ago

@CallmeZhangChenchen 您好,我使用的镜像是nvcr.io/nvidia/pytorch:24.07-py3,但是在转换模型的时候报错,显示no module named megatron,是因为镜像的问题吗?请问您具体使用哪个版本呢

yuanzhiyong1999 commented 1 week ago

文档我看了,问题是你的回复里面写着用的是nvidia的镜像,所以我问问是用的哪个版本,有什么问题? 你不愿意回答可以不回答,你这种表现显得很弱智……

---- 回复的原邮件 ---- | 发件人 | @.> | | 发送日期 | 2024年09月27日 18:37 | | 收件人 | alibaba/Pai-Megatron-Patch @.> | | 抄送人 | Zhiyong @.>, Comment @.> | | 主题 | Re: [alibaba/Pai-Megatron-Patch] 打扰了,提个关于多机训练的issues (Issue #307) |

@CallmeZhangChenchen 您好,我使用的镜像是nvcr.io/nvidia/pytorch:24.07-py3,但是在转换模型的时候报错,显示no module named megatron,是因为镜像的问题吗?请问您具体使用哪个版本呢

真有意思 , 不看文档的吗?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>