Traceback (most recent call last):
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/finetune.py", line 13, in
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/train.py", line 402, in main
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/argparser.py", line 233, in parse_args_into_dataclasses
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/argparser.py", line 243, in common_parse
File "", line 108, in init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/training_args.py", line 1223, in __post_init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/fleet.py", line 340, in init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/fleet.py", line 727, in _init_hybrid_parallel_env
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/base/topology.py", line 218, in init__
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/all_reduce.py", line 89, in all_reduce
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/stream/all_reduce.py", line 157, in all_reduce
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/stream/all_reduce.py", line 51, in _all_reduce_in_dygraph
ValueError: (InvalidArgument) TCP send error. Details: Broken pipe.
[Hint: Expected byte_sent > 0, but received byte_sent:-1 <= 0:0.] (at /root/paddlejob/workspace/env_run/Paddle/paddle/phi/core/distributed/store/tcp_utils.h:83)
bug描述 Describe the Bug
报错信息: 我们使用容器网络训练xpu任务 报了个这个错误
Traceback (most recent call last): File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/finetune.py", line 13, in
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/train.py", line 402, in main
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/argparser.py", line 233, in parse_args_into_dataclasses
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/argparser.py", line 243, in common_parse
File "", line 108, in init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddlenlp/trainer/training_args.py", line 1223, in __post_init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/fleet.py", line 340, in init
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/fleet.py", line 727, in _init_hybrid_parallel_env
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/fleet/base/topology.py", line 218, in init__
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/all_reduce.py", line 89, in all_reduce
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/stream/all_reduce.py", line 157, in all_reduce
File "/tmp/python-task-996883138/pyenc_process_dir/ernie-bot/paddle/distributed/communication/stream/all_reduce.py", line 51, in _all_reduce_in_dygraph
ValueError: (InvalidArgument) TCP send error. Details: Broken pipe.
[Hint: Expected byte_sent > 0, but received byte_sent:-1 <= 0:0.] (at /root/paddlejob/workspace/env_run/Paddle/paddle/phi/core/distributed/store/tcp_utils.h:83)
容器网络环境下以太网卡eth0和roce网卡xgbe2、xgbe3、xgbe4、xgbe5都能通过ping、ib_send_bw测试通,且宿主机网络模式下训练也正常,但是容器网络下不行
其他补充信息 Additional Supplementary Information
No response