Traceback (most recent call last):
File "/root/demo.py", line 5, in <module>
dist.init_parallel_env()
File "/usr/local/python3916/lib/python3.9/site-packages/paddle/distributed/parallel.py", line 297, in init_parallel_env
paddle.distributed.barrier(group=group)
File "/usr/local/python3916/lib/python3.9/site-packages/paddle/distributed/collective.py", line 280, in barrier
task = group.process_group.barrier()
OSError: (External) NCCL error(6), remote process exited or there was a network error.
[Hint: Please search for the error code(6) on website (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#ncclresult-t) to get Nvidia's official solution and advice about NCCL Error.] (at /paddle/paddle/fluid/distributed/collective/ProcessGroupNCCL.cc:247)
bug描述 Describe the Bug
多机训练时,报 NCCL error(6) 错误,但同样环境下pytorch能正常多机训练。
运行环境:容器,2台机器,内网ip分别为223.0.15.19,223.0.15.22 paddle版本:2.4.2
多机代码demo.py如下:
在两台机器上先后运行命令如下:
报错如下:
由于同样环境,同样的ip,使用pytorch都能正常多机运行,肯定不是环境或者nccl的问题。请问paddle多机是否存在bug,或者说paddle的启动命令不对?请paddle大佬帮忙解决一下,万分感谢~
其他补充信息 Additional Supplementary Information
在两台机器上使用paddle.utils.run_check()返回的结果都是正常的,两台服务器跑paddle单机多卡任务也正常