JDAI-CV / fast-reid

SOTA Re-identification Methods and Toolbox
Apache License 2.0
3.42k stars 837 forks source link

多机训练慢 #656

Closed annyhou closed 2 years ago

annyhou commented 2 years ago

你好! 问题描述:相同条件下,多机单卡比单机单卡慢66倍。我测试了train函数的时间消耗情况,如下为1个epoch的时间消耗打印日志: 单机单卡: @@@@@self.before_train=2.8321053832769394e-05 =======self.before_epoch()=4.8844958655536175e-05 self.before_step()=3.951898543164134e-05 self.run_step()=1.4909071250003763 self.after_step()=0.0007456150487996638 =======self.after_step()=1.4916812470182776 self.before_step()=0.00012545095523819327 self.run_step()=0.12989151704823598 self.after_step()=0.001608320977538824 =======self.after_step()=1.6233065359992906 .... self.before_step()=6.265495903789997e-05 self.run_step()=0.12781515199458227 ****self.after_step()=0.0010411710245534778 =======self.after_step()=4.622910676000174 [04/26 09:58:40 fastreid.utils.events]: eta: 0:01:01 epoch/iter: 0/24 total_loss: 0.6677 time: 0.1291 data_time: 0.0008 lr: 1.22e-03 max_mem: 5871M =======self.after_epoch()=0.005098460998851806 @@@@@self.after_epoch()=4.628031279018614

多机单卡: @@@@@self.before_train=5.099800182506442e-05 =======self.before_epoch()=6.922002648934722e-05 self.before_step()=5.4769974667578936e-05 self.run_step()=10.933831950998865 self.after_step()=0.0009852820076048374 =======self.after_step()=10.934859095024876 self.before_step()=6.68829889036715e-05 self.run_step()=8.259477029030677 self.after_step()=0.0025403889594599605 =======self.after_step()=19.196943396003917 .... self.before_step()=8.85130139067769e-05 self.run_step()=8.541965569020249 ****self.after_step()=0.0023366750101558864 =======self.after_step()=210.79022687504767 [04/26 09:48:33 fastreid.utils.events]: eta: 1:05:25 epoch/iter: 0/24 total_loss: 0.6671 time: 8.3279 data_time: 0.0019 lr: 1.22e-03 max_mem: 5440M =======self.after_epoch()=0.0274219709681347 @@@@@self.after_epoch()=210.81768603197997

提出问题:主要消耗时间为run_step()函数,多机上应该怎么解决训练慢这一问题?谢谢🙏

annyhou commented 2 years ago

代码:

time
L1aoXingyu commented 2 years ago

可以观察 log 里面的 time 和 data time,可以发现数据读取的时间基本没变,那应该是网络的时间变了,在数据并行的多卡里面,nccl 通信只发生在梯度的同步,所以看上去应该是多机之间的网络通信太慢了,可以查看一下多机之间的通信网口以及通信的速度。

annyhou commented 2 years ago

网口设置: 服务器1: export GLOO_SOCKET_IFNAME=eno1 export NCCL_SOCKET_IFNAME=eno1 服务器2: export GLOO_SOCKET_IFNAME=eno2 export NCCL_SOCKET_IFNAME=eno2 后端协议用的NCCL:

test1

数据采样TrainingSampler,这里又采用了GLOO

test2

所以整个通信用的是GLOO这种方式么?

annyhou commented 2 years ago

网络通信速度:

image

这个是在同一局域网,应该不是网络带宽的原因吧。 通信用的是nccl,数据采样和日志间同步用的是gloo。所以多机间训练慢是什么原因啊?

L1aoXingyu commented 2 years ago

有 nvlink 吗

annyhou commented 2 years ago

相同服务器跑yolov5/MMCV的多机训练不得这么慢

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 2 years ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

yu9s commented 1 year ago

请问为什么多卡训练的,单卡测试报错呢