打扰了，提个关于多机训练的issues

CallmeZhangChenchen commented 2 months ago

两台机器进行多机训练

机器显卡信息 10.25.117.11 8 4090 10.25.112.55 4 A30

网卡信息 10.25.117.11 master节点


bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
    inet6 fe80::268a:7ff:feb7:4df2  prefixlen 64  scopeid 0x20<link>
    ether 24:8a:07:b7:4d:f2  txqueuelen 1000  (Ethernet)
    RX packets 506548772  bytes 697882435572 (649.9 GiB)
    RX errors 0  dropped 0  overruns 0  frame 0
    TX packets 613972733  bytes 851236125752 (792.7 GiB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

bond0.137: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.25.117.11 netmask 255.255.255.0 broadcast 10.25.117.255 inet6 fe80::268a:7ff:feb7:4df2 prefixlen 64 scopeid 0x20 ether 24:8a:07:b7:4d:f2 txqueuelen 1000 (Ethernet) RX packets 88954399 bytes 674039906438 (627.7 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 99697407 bytes 823309078853 (766.7 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255 inet6 fe80::42:9cff:fe1a:15a3 prefixlen 64 scopeid 0x20 ether 02:42:9c:1a:15:a3 txqueuelen 0 (Ethernet) RX packets 54407914 bytes 40798973786 (37.9 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 22179456 bytes 110587186015 (102.9 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0


10.25.112.55

bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500 inet6 fe80::6efe:54ff:fe6b:a44c prefixlen 64 scopeid 0x20 ether 6c:fe:54:6b:a4:4c txqueuelen 1000 (Ethernet) RX packets 241030415 bytes 317399195309 (295.6 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 153855256 bytes 148307704917 (138.1 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

bond0.133: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.25.112.55 netmask 255.255.254.0 broadcast 10.25.113.255 inet6 fe80::6efe:54ff:fe6b:a44c prefixlen 64 scopeid 0x20 ether 6c:fe:54:6b:a4:4c txqueuelen 1000 (Ethernet) RX packets 79280870 bytes 307355975852 (286.2 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 83545680 bytes 144470622113 (134.5 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.28.0.1 netmask 255.255.0.0 broadcast 172.28.255.255 inet6 fe80::42:3bff:febd:e88d prefixlen 64 scopeid 0x20 ether 02:42:3b:bd:e8:8d txqueuelen 0 (Ethernet) RX packets 38467968 bytes 33168475337 (30.8 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 11578107 bytes 8172382370 (7.6 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0


 - 容器启动

docker run -it --gpus all --network host -p 55554:55554 **


 - 脚本配置
10.25.117.11

ENV=$1 if [ $ENV = dsw ]; then export CUDA_VISIBLE_DEVICES=0,1,2,3 MASTER_ADDR=10.25.117.11 MASTER_PORT=55554 NNODES=2 NODE_RANK=0 GPUS_PER_NODE=4

10.25.112.55

ENV=$1 if [ $ENV = dsw ]; then export CUDA_VISIBLE_DEVICES=0,1,2,3 MASTER_ADDR=10.25.117.11 MASTER_PORT=55554 NNODES=2 NODE_RANK=1 GPUS_PER_NODE=4


 - 训练流程：
```shell
export NCCL_SOCKET_IFNAME=bond0
sh run_finetune_qwen.sh  dsw  0.5B   1    8 1e-5   1e-6   128  128  bf16  2   1  1 sel  true   false false   true   100  ./qwen-datasets/alpaca_zh-qwen-train.json   ./qwen-datasets/alpaca_zh-qwen-valid.json   ./qwen-ckpts/Qwen2-0.5B-to-mcore-tp2-pp1-ep1/   1000   10   ./output_mcore_qwen

两台机器在加载过数据之后报错 10.25.117.11

Running Encoding (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████| 50151/50151 [01:16<00:00, 655.00 examples/s]
50151it [00:00, 173638.71it/s]
>> total number of samples: 45413
50151it [00:00, 174172.55it/s]
>> total number of samples: 45413
50151it [00:00, 174067.05it/s]
>> total number of samples: 45413
50151it [00:00, 175514.23it/s]
>> total number of samples: 45413
[after dataloaders are built] datetime: 2024-07-31 07:23:34 
done with setup ...
training ...
[before the start of training step] datetime: 2024-07-31 07:23:34 
[rank3]: Traceback (most recent call last):
[rank3]:   File "/workspace/Pai-Megatron-Patch/examples/qwen2/pretrain_qwen.py", line 211, in <module>
[rank3]:   File "/workspace/Pai-Megatron-Patch/Megatron-LM-240612/megatron/training/training.py", line 274, in pretrain
[rank3]:     iteration, num_floating_point_operations_so_far = train(
[rank3]:   File "/workspace/Pai-Megatron-Patch/Megatron-LM-240612/megatron/training/training.py", line 1030, in train
[rank3]:   File "/workspace/Pai-Megatron-Patch/Megatron-LM-240612/megatron/training/training.py", line 549, in train_step
[rank3]:     losses_reduced = forward_backward_func(
[rank3]:   File "/workspace/Pai-Megatron-Patch/Megatron-LM-240612/megatron/core/pipeline_parallel/schedules.py", line 381, in forward_backward_no_pipelining
[rank3]:     output_tensor, num_tokens = forward_step(
[rank3]:   File "/workspace/Pai-Megatron-Patch/Megatron-LM-240612/megatron/core/pipeline_parallel/schedules.py", line 215, in forward_step
[rank3]:     outputs = loss_func(output_tensor)
[rank3]:   File "/workspace/Pai-Megatron-Patch/examples/qwen2/pretrain_qwen.py", line 123, in loss_func
[rank3]:     averaged_loss = average_losses_across_data_parallel_group([loss])
[rank3]:   File "/workspace/Pai-Megatron-Patch/Megatron-LM-240612/megatron/training/utils.py", line 101, in average_losses_across_data_parallel_group
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 78, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2287, in all_reduce
[rank3]:     work = group.allreduce([tensor], opts)
[rank3]: torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error (run with NCCL_
DEBUG=INFO for details), NCCL version 2.22.3
[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank3]: Last error:
[rank3]: socketStartConnect: Connect to fe80::6efe:54ff:fe6b:a44c%usb0<12103> failed : Network is unreachable

10.25.112.55

Running Encoding (num_proc=16): 100%|███████████████████████████████████████████████████████████████████| 50151/50151 [01:05<00:00, 760.12 examples/s]
50151it [00:00, 207525.36it/s]
>> total number of samples: 45413
50151it [00:00, 215088.00it/s]
>> total number of samples: 45413
50151it [00:00, 212711.55it/s]
>> total number of samples: 45413
50151it [00:00, 215017.65it/s]
>> total number of samples: 45413
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (1768.80, 1775.64)
train/valid/test-data-iterators-setup ..........: (90930.58, 90930.97)
[rank7]: Traceback (most recent call last):
[rank7]:   File "/workspace/Pai-Megatron-Patch/examples/qwen2/pretrain_qwen.py", line 211, in <module>
[rank7]:   File "/workspace/Pai-Megatron-Patch/Megatron-LM-240612/megatron/training/training.py", line 274, in pretrain
[rank7]:     iteration, num_floating_point_operations_so_far = train(
[rank7]:   File "/workspace/Pai-Megatron-Patch/Megatron-LM-240612/megatron/training/training.py", line 1030, in train
[rank7]:   File "/workspace/Pai-Megatron-Patch/Megatron-LM-240612/megatron/training/training.py", line 549, in train_step
[rank7]:     losses_reduced = forward_backward_func(
[rank7]:   File "/workspace/Pai-Megatron-Patch/Megatron-LM-240612/megatron/core/pipeline_parallel/schedules.py", line 381, in forward_backward_no_pipe
lining
[rank7]:     output_tensor, num_tokens = forward_step(
[rank7]:   File "/workspace/Pai-Megatron-Patch/Megatron-LM-240612/megatron/core/pipeline_parallel/schedules.py", line 215, in forward_step
[rank7]:     outputs = loss_func(output_tensor)
[rank7]:   File "/workspace/Pai-Megatron-Patch/examples/qwen2/pretrain_qwen.py", line 123, in loss_func
[rank7]:     averaged_loss = average_losses_across_data_parallel_group([loss])
[rank7]:   File "/workspace/Pai-Megatron-Patch/Megatron-LM-240612/megatron/training/utils.py", line 101, in average_losses_across_data_parallel_group
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 78, in wrapper
[rank7]:     return func(*args, **kwargs)
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2287, in all_reduce
[rank7]:     work = group.allreduce([tensor], opts)
[rank7]: torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled system error
(run with NCCL_DEBUG=INFO for details), NCCL version 2.22.3
[rank7]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank7]: Last error:
[rank7]: socketStartConnect: Connect to fe80::268a:7ff:feb7:4df2%bond0.133<19491> failed : Software caused connection abort

报错原因分析： 10.25.117.11在连接10.25.112.55的时候失败： Connect to fe80::6efe:54ff:fe6b:a44c%usb0<12103> 但是 ping6 fe80::6efe:54ff:fe6b:a44c%bond0 能通 10.25.112.55在连接10.25.117.11的时候失败： Connect to fe80::268a:7ff:feb7:4df2%bond0.133<19491> 但是 ping6 fe80::268a:7ff:feb7:4df2%bond0能通

不知道为什么开始训练的时候，通过ipv6连接的时候 % 后面的 bond0 对不上，我设置了环境变量 export NCCL_SOCKET_IFNAME=bond0 不起作用

CallmeZhangChenchen commented 2 months ago

当打开 export NCCL_DEBUG=INFO 时，会看到跟上面报错类似的信息

l117-11-p-ga:3330:5736 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::6efe:54ff:fe6b:a44c%usb0<25999> failed : Network is unreachable l112-55-p-ga:3103:5301 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::268a:7ff:feb7:4df2%bond0.133<49025> failed : Software caused connection abort

看来还是我的网络有什么问题，希望能给予一点指点

使用的镜像 nvcr.io/nvidia/pytorch:xx.xx-py3，什么都是默认的没有改什么东西，

CallmeZhangChenchen commented 2 months ago

应该是我这两台机器的问题

在同网段的两台机器上测试，训起来了

yuanzhiyong1999 commented 1 week ago

@CallmeZhangChenchen 您好，我使用的镜像是nvcr.io/nvidia/pytorch:24.07-py3，但是在转换模型的时候报错，显示no module named megatron，是因为镜像的问题吗？请问您具体使用哪个版本呢

yuanzhiyong1999 commented 1 week ago

文档我看了，问题是你的回复里面写着用的是nvidia的镜像，所以我问问是用的哪个版本，有什么问题？你不愿意回答可以不回答，你这种表现显得很弱智……

---- 回复的原邮件 ---- | 发件人 | @.> | | 发送日期 | 2024年09月27日 18:37 | | 收件人 | alibaba/Pai-Megatron-Patch @.> | | 抄送人 | Zhiyong @.>, Comment @.> | | 主题 | Re: [alibaba/Pai-Megatron-Patch] 打扰了，提个关于多机训练的issues (Issue #307) |

@CallmeZhangChenchen 您好，我使用的镜像是nvcr.io/nvidia/pytorch:24.07-py3，但是在转换模型的时候报错，显示no module named megatron，是因为镜像的问题吗？请问您具体使用哪个版本呢

真有意思 , 不看文档的吗?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

alibaba / Pai-Megatron-Patch

打扰了，提个关于多机训练的issues #307