Questions about training details

Mukosame / Zooming-Slow-Mo-CVPR-2020

Fast and Accurate One-Stage Space-Time Video Super-Resolution (accepted in CVPR 2020)

GNU General Public License v3.0

917 stars 164 forks source link

Questions about training details #27

Closed zhiyangxu-VT closed 4 years ago

zhiyangxu-VT commented 4 years ago

Pretty impressed by your work. I have some questions about the training phase. When I trained the model on 2 Nvidia 1080Ti GPUs with batch size 8 (followed by training details in the paper), it seems that it requires more time to achieve your results. I am wondering which way (dataparallel or torch.distributed.launch) do you recommend to use in the experiment?

Mukosame commented 4 years ago

Hi, thanks for your interest1 we set the batch size to be 24 in the paper. I think the main reason is the small batch size you adapted. Dataparallel or distributed shouldn't influence the performance but just the consumed time for each epoch. Hope this solves your problem :)

zhiyangxu-VT commented 4 years ago

Thank you for your information! One more question, how many iterations does your model achieve the convergence? I saw in the yaml file the number of iterations is 600k? But 600K will take a long time to run. In your paper, you mentioned it took you a day and half to train the model.

Mukosame commented 4 years ago

Hi, the 600k is split into 4 periods. The model will converge within each period, and achieve better results when finishing the next period. So you can choose where to stop to reach a balance between the training time and model performance in actual application. I didn't mention anything like "a day and half" in the paper. It usually takes more than 2 days to finish each period.

zhiyangxu-VT commented 4 years ago

Thanks for your patience and explanation. It indeed helps. I must remember the training time wrong. Sorry for that. When transiting from one period to the next what do I need to change? Do I need to change batch size, learning rate or other params?

Mukosame commented 4 years ago

It's fine. I just edited the train.yml so that you don't need to change anything if you follow the default setting given on the github,. The setting of training periods can be found here: https://github.com/Mukosame/Zooming-Slow-Mo-CVPR-2020/blob/master/codes/options/train/train_zsm.yml#L56

zhiyangxu-VT commented 4 years ago

Thanks for your replying. This project is a great work.