Closed annyhou closed 2 years ago
代码:
可以观察 log 里面的 time 和 data time,可以发现数据读取的时间基本没变,那应该是网络的时间变了,在数据并行的多卡里面,nccl 通信只发生在梯度的同步,所以看上去应该是多机之间的网络通信太慢了,可以查看一下多机之间的通信网口以及通信的速度。
网口设置: 服务器1: export GLOO_SOCKET_IFNAME=eno1 export NCCL_SOCKET_IFNAME=eno1 服务器2: export GLOO_SOCKET_IFNAME=eno2 export NCCL_SOCKET_IFNAME=eno2 后端协议用的NCCL:
数据采样TrainingSampler,这里又采用了GLOO
所以整个通信用的是GLOO这种方式么?
网络通信速度:
这个是在同一局域网,应该不是网络带宽的原因吧。 通信用的是nccl,数据采样和日志间同步用的是gloo。所以多机间训练慢是什么原因啊?
有 nvlink 吗
相同服务器跑yolov5/MMCV的多机训练不得这么慢
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.
请问为什么多卡训练的,单卡测试报错呢
你好! 问题描述:相同条件下,多机单卡比单机单卡慢66倍。我测试了train函数的时间消耗情况,如下为1个epoch的时间消耗打印日志: 单机单卡: @@@@@self.before_train=2.8321053832769394e-05 =======self.before_epoch()=4.8844958655536175e-05 self.before_step()=3.951898543164134e-05 self.run_step()=1.4909071250003763 self.after_step()=0.0007456150487996638 =======self.after_step()=1.4916812470182776 self.before_step()=0.00012545095523819327 self.run_step()=0.12989151704823598 self.after_step()=0.001608320977538824 =======self.after_step()=1.6233065359992906 .... self.before_step()=6.265495903789997e-05 self.run_step()=0.12781515199458227 ****self.after_step()=0.0010411710245534778 =======self.after_step()=4.622910676000174 [04/26 09:58:40 fastreid.utils.events]: eta: 0:01:01 epoch/iter: 0/24 total_loss: 0.6677 time: 0.1291 data_time: 0.0008 lr: 1.22e-03 max_mem: 5871M =======self.after_epoch()=0.005098460998851806 @@@@@self.after_epoch()=4.628031279018614
多机单卡: @@@@@self.before_train=5.099800182506442e-05 =======self.before_epoch()=6.922002648934722e-05 self.before_step()=5.4769974667578936e-05 self.run_step()=10.933831950998865 self.after_step()=0.0009852820076048374 =======self.after_step()=10.934859095024876 self.before_step()=6.68829889036715e-05 self.run_step()=8.259477029030677 self.after_step()=0.0025403889594599605 =======self.after_step()=19.196943396003917 .... self.before_step()=8.85130139067769e-05 self.run_step()=8.541965569020249 ****self.after_step()=0.0023366750101558864 =======self.after_step()=210.79022687504767 [04/26 09:48:33 fastreid.utils.events]: eta: 1:05:25 epoch/iter: 0/24 total_loss: 0.6671 time: 8.3279 data_time: 0.0019 lr: 1.22e-03 max_mem: 5440M =======self.after_epoch()=0.0274219709681347 @@@@@self.after_epoch()=210.81768603197997
提出问题:主要消耗时间为run_step()函数,多机上应该怎么解决训练慢这一问题?谢谢🙏