Open welsonzhang opened 1 week ago
启动命令:/data/miniconda3/envs/py36/bin/python3.6 -m paddle.distributed.launch --server_num=1 --worker_num=10 --servers=10.62.88.103:4425 --workers=10.62.86.154:4426,10.62.70.134:4426,10.60.150.6:4426,10.62.64.49:4426,10.60.172.0:4426,10.60.147.70:4426,10.62.81.110:4426,10.62.98.74:4426,10.60.163.132:4426,10.62.77.233:4426 /usr/local/train.py
版本2.4
已解决,主要手动设置变量才能对齐。不然会随机取值,导致ps和worker对不齐。 export PADDLE_WITH_GLOO=1 export FLAGS_START_PORT=5678
请提出你的问题 Please ask your question
分布式cpu多机训练, 启动gloo卡主了, (不启动gloo不会卡主), 想问一下这个gloo是干什么用的? 具体代码如下:
worker端日志: server not ready, wait 3 sec to retry... not ready endpoints:['10.60.174.62:43747']
server端日志: fl-ps > coordinator address is null! Gloo init with HTTP: need_init_all: False, args: {'http.host': '10.60.174.62', 'http.port': '52503', 'store.prefix': '', 'start_http_server': True, 'http_server_d': <DictProxy object, typeid 'dict' at 0x7fda1adf2160>} to start http_server worker_key:_worker, size: {'_worker': 10} start http_server: 52503, {'_worker': 10}