PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.12k stars 5.55k forks source link

分布式CPU训练使用gloo卡主 #68308

Open welsonzhang opened 1 week ago

welsonzhang commented 1 week ago

请提出你的问题 Please ask your question

分布式cpu多机训练, 启动gloo卡主了, (不启动gloo不会卡主), 想问一下这个gloo是干什么用的? 具体代码如下:

class Main(object):
    def __init__(self, config):
        self.config = config

    def run(self):
        self.init_fleet()
        self.init_network()
        if fleet.is_server():
            self.run_server()
        elif fleet.is_worker():
            self.run_online_worker()
        logger.info("Run Success, Exit.")

    def init_fleet(self):
        #fleet.init()
        os.environ["PADDLE_WITH_GLOO"] = "1"
        role = role_maker.PaddleCloudRoleMaker()
        fleet.init(role)

worker端日志: server not ready, wait 3 sec to retry... not ready endpoints:['10.60.174.62:43747']

server端日志: fl-ps > coordinator address is null! Gloo init with HTTP: need_init_all: False, args: {'http.host': '10.60.174.62', 'http.port': '52503', 'store.prefix': '', 'start_http_server': True, 'http_server_d': <DictProxy object, typeid 'dict' at 0x7fda1adf2160>} to start http_server worker_key:_worker, size: {'_worker': 10} start http_server: 52503, {'_worker': 10}

welsonzhang commented 5 days ago

启动命令:/data/miniconda3/envs/py36/bin/python3.6 -m paddle.distributed.launch --server_num=1 --worker_num=10 --servers=10.62.88.103:4425 --workers=10.62.86.154:4426,10.62.70.134:4426,10.60.150.6:4426,10.62.64.49:4426,10.60.172.0:4426,10.60.147.70:4426,10.62.81.110:4426,10.62.98.74:4426,10.60.163.132:4426,10.62.77.233:4426 /usr/local/train.py

welsonzhang commented 5 days ago

版本2.4

welsonzhang commented 5 days ago

已解决,主要手动设置变量才能对齐。不然会随机取值,导致ps和worker对不齐。 export PADDLE_WITH_GLOO=1 export FLAGS_START_PORT=5678