alibaba / FastNN

FastNN provides distributed training examples that use EPL.
Apache License 2.0
81 stars 19 forks source link

2台服务器分布式跑resnet_split.py遇到无限等待的情况 #14

Open alphabewitch opened 1 year ago

alphabewitch commented 1 year ago

环境: nvcr.io/nvidia/tensorflow:21.12-tf1-py3镜像的容器 代码: FastNN/resnet/resnet_split.py 执行命令: 服务器1:TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":0}}' bash scripts/train_split.sh 服务器2:TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":1}}' bash scripts/train_split.sh

服务器1的执行情况: image 服务器2的执行情况: image

可以看到服务器1的still waiting只打印了2条就不打印了说明已经接收到了服务器2的回复,但是没有继续往下运行。 补充: 同样的环境可以分布式运行bert,服务器之间是可以正常连接跑分布式训练的。

想问下是我的执行问题还是代码需要进行修改?