PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.05k stars 5.54k forks source link

XPU多机分布式训练报错AttributeError: module 'paddle.base.libpaddle' has no attribute 'ProcessGroupBKCL' #66335

Closed MiltonZheng closed 1 month ago

MiltonZheng commented 1 month ago

请提出你的问题 Please ask your question

配置:两台飞腾S2500服务器,操作系统麒麟V10,各装有一张昆仑R200,编译安装的Paddle分支为release/2.6 按照官网指令运行/usr/bin/python3 -m paddle.distributed.launch --enable_gpu_log=False --ips=10.10.5.10,10.10.5.11 test.py python文件

from paddle.distributed import fleet
fleet.init(is_collective=True)

报错内容

[root@node10 ~]# /usr/bin/python3 -m paddle.distributed.launch \
> --enable_gpu_log=False \
> --ips=10.10.5.10,10.10.5.11 \
> /root/zmh/CrossModal/test.py
XPURT /usr/local/lib64/python3.7/site-packages/paddle/base/../libs/libxpurt.so.1 loaded
LAUNCH INFO 2024-07-22 15:37:00,853 -----------  Configuration  ----------------------
LAUNCH INFO 2024-07-22 15:37:00,853 auto_parallel_config: None
LAUNCH INFO 2024-07-22 15:37:00,854 auto_tuner_json: None
LAUNCH INFO 2024-07-22 15:37:00,854 devices: None
LAUNCH INFO 2024-07-22 15:37:00,854 elastic_level: -1
LAUNCH INFO 2024-07-22 15:37:00,854 elastic_timeout: 30
LAUNCH INFO 2024-07-22 15:37:00,854 enable_gpu_log: 0
LAUNCH INFO 2024-07-22 15:37:00,854 gloo_port: 6767
LAUNCH INFO 2024-07-22 15:37:00,854 host: None
LAUNCH INFO 2024-07-22 15:37:00,854 ips: 10.10.5.10,10.10.5.11
LAUNCH INFO 2024-07-22 15:37:00,854 job_id: default
LAUNCH INFO 2024-07-22 15:37:00,855 legacy: False
LAUNCH INFO 2024-07-22 15:37:00,855 log_dir: log
LAUNCH INFO 2024-07-22 15:37:00,855 log_level: INFO
LAUNCH INFO 2024-07-22 15:37:00,855 log_overwrite: False
LAUNCH INFO 2024-07-22 15:37:00,855 master: None
LAUNCH INFO 2024-07-22 15:37:00,855 max_restart: 3
LAUNCH INFO 2024-07-22 15:37:00,855 nnodes: 1
LAUNCH INFO 2024-07-22 15:37:00,855 nproc_per_node: None
LAUNCH INFO 2024-07-22 15:37:00,855 rank: -1
LAUNCH INFO 2024-07-22 15:37:00,855 run_mode: collective
LAUNCH INFO 2024-07-22 15:37:00,855 server_num: None
LAUNCH INFO 2024-07-22 15:37:00,856 servers: 
LAUNCH INFO 2024-07-22 15:37:00,856 sort_ip: False
LAUNCH INFO 2024-07-22 15:37:00,856 start_port: 6070
LAUNCH INFO 2024-07-22 15:37:00,856 trainer_num: None
LAUNCH INFO 2024-07-22 15:37:00,856 trainers: 
LAUNCH INFO 2024-07-22 15:37:00,856 training_script: /root/zmh/CrossModal/test.py
LAUNCH INFO 2024-07-22 15:37:00,856 training_script_args: []
LAUNCH INFO 2024-07-22 15:37:00,856 with_gloo: 1
LAUNCH INFO 2024-07-22 15:37:00,856 --------------------------------------------------
LAUNCH INFO 2024-07-22 15:37:00,857 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2024-07-22 15:37:00,858 Run Pod: khfagj, replicas 1, status ready
LAUNCH INFO 2024-07-22 15:37:00,869 Watching Pod: khfagj, replicas 1, status running
XPURT /usr/local/lib64/python3.7/site-packages/paddle/base/../libs/libxpurt.so.1 loaded
[2024-07-22 15:37:05,067] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_xpus', current_value='0', default_value='')
=======================================================================
I0722 15:37:05.070785 564269 tcp_utils.cc:181] The server starts to listen on IP_ANY:6070
I0722 15:37:05.071185 564269 tcp_utils.cc:130] Successfully connected to 10.10.5.10:6070
Traceback (most recent call last):
  File "/root/zmh/CrossModal/test.py", line 3, in <module>
    fleet.init(is_collective=True)
  File "/usr/local/lib64/python3.7/site-packages/paddle/distributed/fleet/fleet.py", line 287, in init
    paddle.distributed.init_parallel_env()
  File "/usr/local/lib64/python3.7/site-packages/paddle/distributed/parallel.py", line 1107, in init_parallel_env
    pg_options=None,
  File "/usr/local/lib64/python3.7/site-packages/paddle/distributed/collective.py", line 165, in _new_process_group_impl
    pg = core.ProcessGroupBKCL.create(store, rank, world_size, group_id)
AttributeError: module 'paddle.base.libpaddle' has no attribute 'ProcessGroupBKCL'
I0722 15:37:13.944754 564290 tcp_store.cc:289] receive shutdown event and so quit from MasterDaemon run loop
No XPU Memory Leak
LAUNCH INFO 2024-07-22 15:37:14,887 Pod failed
LAUNCH ERROR 2024-07-22 15:37:14,887 Container failed !!!
MiltonZheng commented 1 month ago

已解决,原因是编译时没有选择-DWITH_XPU_BKCL=on,重新编译后可以正常使用