PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.26k stars 5.6k forks source link

多机训练时,报 NCCL error(6) 错误,但同样环境下pytorch能正常多机训练 #63348

Open KrisZHH opened 7 months ago

KrisZHH commented 7 months ago

bug描述 Describe the Bug

多机训练时,报 NCCL error(6) 错误,但同样环境下pytorch能正常多机训练。

运行环境:容器,2台机器,内网ip分别为223.0.15.19,223.0.15.22 paddle版本:2.4.2

多机代码demo.py如下:

import os
import paddle
import paddle.distributed as dist
# os.environ["NCCL_SOCKET_IFNAME"]="ens"  # 这个环境变量无论添加与否都报错
dist.init_parallel_env()

在两台机器上先后运行命令如下:

python3 -m paddle.distributed.launch --ips 223.0.15.19,223.0.15.22 --gpus 0,1,2,3 demo.py

报错如下:

Traceback (most recent call last):
  File "/root/demo.py", line 5, in <module>
    dist.init_parallel_env()
  File "/usr/local/python3916/lib/python3.9/site-packages/paddle/distributed/parallel.py", line 297, in init_parallel_env
    paddle.distributed.barrier(group=group)
  File "/usr/local/python3916/lib/python3.9/site-packages/paddle/distributed/collective.py", line 280, in barrier
    task = group.process_group.barrier()
OSError: (External) NCCL error(6), remote process exited or there was a network error.
  [Hint: Please search for the error code(6) on website (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#ncclresult-t) to get Nvidia's official solution and advice about NCCL Error.] (at /paddle/paddle/fluid/distributed/collective/ProcessGroupNCCL.cc:247)

由于同样环境,同样的ip,使用pytorch都能正常多机运行,肯定不是环境或者nccl的问题。请问paddle多机是否存在bug,或者说paddle的启动命令不对?请paddle大佬帮忙解决一下,万分感谢~

其他补充信息 Additional Supplementary Information

在两台机器上使用paddle.utils.run_check()返回的结果都是正常的,两台服务器跑paddle单机多卡任务也正常

Franklinyung commented 3 months ago

遇到了同样的问题

KrisZHH commented 3 months ago

遇到了同样的问题

你可以查看/etc/hosts下是否是正常的ip地址。paddle获取host ip的接口函数是依据/etc/hosts这个文件(这个真的很蠢,为什么不学学pytorch),但很多服务器的/etc/hosts文件默认是127.0.0.1 localhost,这就导致paddle无法正常获取ip地址(应为拿到的是127.0.0.1)。你把/etc/hosts里改成本机ip就可以了。