PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.29k stars 5.62k forks source link

在容器网络环境中进行多机P800分布式训练报错 #69220

Open yalbaba opened 2 weeks ago

yalbaba commented 2 weeks ago

bug描述 Describe the Bug

报错信息: 我们使用容器网络训练xpu任务 报了个这个错误

Traceback (most recent call last): File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/finetune.py", line 13, in File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/train.py", line 402, in main File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/argparser.py", line 233, in parse_args_into_dataclasses File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/argparser.py", line 243, in common_parse File "", line 108, in init File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/training_args.py", line 1223, in __post_init File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/fleet.py", line 340, in init File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/fleet.py", line 727, in _init_hybrid_parallel_env File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/base/topology.py", line 218, in init__ File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/all_reduce.py", line 89, in all_reduce File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/stream/all_reduce.py", line 157, in all_reduce File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/stream/all_reduce.py", line 51, in _all_reduce_in_dygraph ValueError: (InvalidArgument) TCP send error. Details: Broken pipe. [Hint: Expected byte_sent > 0, but received byte_sent:-1 <= 0:0.] (at /root/paddlejob/workspace/env_run/Paddle/paddle/phi/core/distributed/store/tcp_utils.h:83)

容器网络环境下以太网卡eth0和roce网卡xgbe2、xgbe3、xgbe4、xgbe5都能通过ping、ib_send_bw测试通,且宿主机网络模式下训练也正常,但是容器网络下不行

其他补充信息 Additional Supplementary Information

No response

westfish commented 2 weeks ago

您的问题很可能是由于容器网络环境中的配置导致节点间通信失败,可能需要调整容器的网络设置,确保必要的端口和网络接口正确映射和配置,可以重点确认一下容器网络端口是否正确映射,检查并配置防火墙,确保允许训练所需的端口通信。