About the Super Huge Training Cost

NVIDIA / flownet2-pytorch

Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

Other

3.12k stars 738 forks source link

About the Super Huge Training Cost #25

Closed PkuRainBow closed 6 years ago

PkuRainBow commented 6 years ago

My device is :

Ubuntu16.04 SSD 850EVO GPU: 2 x TITANX Pascal

The forward&backward cost for each iteration is 1.5s It seems that your method will continue for 10,000 epoch (each epoch takes 2,859 iterations), so the estimated training cost will be 10,000 x 2,859 x 1.5s = 4.3 x 10^7 s = 12,000 hours = 497 days.

Roughly speaking, it still takes one hour for every epoch even with SSD and 2xTITAN X Pascal GPU.

Which is almost impossible for me to retrain the FlowNet2 from scratch !

It seems the orginal flownet will train for 1.7 x 10^6 iteraions with the S_long method while in your settings it will train for 2.9 x 10^7 iterations.

Could you help me check whether I miss something? Besides, I am wondering whether you could provide me advice on how to accelerate my training.

Below is my training setttings:

default

PkuRainBow commented 6 years ago

@fitsumreda

fitsumreda commented 6 years ago

@PkuRainBow the default total_num_epochs is there as default.

To reproduce the paper results, you'll only need to run (600K (Short), 1.2M (Long), and 1.7M (Fine)) mini-batch iterations (not epochs). so, with the 1.5s/min-batch that you are getting, it would take almost 10 days (1.5 * 6e5/ 3600.) on a single GPU to finish Short .

This is however much faster for FlowNet-S, it would only take 24 hours on a single GPU.

if you want to reproduce FlowNet2 results, FlowNet-C needs to be trained only once. The remaining networks (FlowNet-Fusion and FlowNet-SD) don't use the correlation layer and can be trained efficiently.

There are still rooms for further optimisation of the Correlation kernel (which is the bottleneck). We may use optimized linear algebra libraries.

PkuRainBow commented 6 years ago

@fitsumreda So according to your advice, 600K mini-batches means 200 epoch over flying chairs as every epoch will takes about 3000 mini-batches.

fitsumreda commented 6 years ago

right.

PkuRainBow commented 6 years ago

@fitsumreda Thanks for your advice! FlowNetS is really fast, besides, it would be great if you could share whether could your implementation reproduce the reported numbers in FlowNet-v1 of FlowNetS.

According to my current estimate, it will takes at most 4 hours to train the FlowNetS for 200 epoches.

fitsumreda commented 6 years ago

it gets close results on FlowNet-S, but not able to quite reproduce the paper results since the data augmentation part is missing, the networks are most likely overfitting

MatthewD1993 commented 6 years ago

@fitsumreda Could you share how you train FlowNet2C? I trained it for about the 10 epochs but the result looks ugly. The EPE error does not have a trend of decreasing. While FlowNet2S gives reasonable result after just 1 epoch. Thanks!