Closed zhiyangxu-VT closed 4 years ago
Hi, thanks for your interest1 we set the batch size to be 24 in the paper. I think the main reason is the small batch size you adapted. Dataparallel or distributed shouldn't influence the performance but just the consumed time for each epoch. Hope this solves your problem :)
Thank you for your information! One more question, how many iterations does your model achieve the convergence? I saw in the yaml file the number of iterations is 600k? But 600K will take a long time to run. In your paper, you mentioned it took you a day and half to train the model.
Hi, the 600k is split into 4 periods. The model will converge within each period, and achieve better results when finishing the next period. So you can choose where to stop to reach a balance between the training time and model performance in actual application. I didn't mention anything like "a day and half" in the paper. It usually takes more than 2 days to finish each period.
Thanks for your patience and explanation. It indeed helps. I must remember the training time wrong. Sorry for that. When transiting from one period to the next what do I need to change? Do I need to change batch size, learning rate or other params?
It's fine. I just edited the train.yml so that you don't need to change anything if you follow the default setting given on the github,. The setting of training periods can be found here: https://github.com/Mukosame/Zooming-Slow-Mo-CVPR-2020/blob/master/codes/options/train/train_zsm.yml#L56
Thanks for your replying. This project is a great work.
Pretty impressed by your work. I have some questions about the training phase. When I trained the model on 2 Nvidia 1080Ti GPUs with batch size 8 (followed by training details in the paper), it seems that it requires more time to achieve your results. I am wondering which way (dataparallel or torch.distributed.launch) do you recommend to use in the experiment?