Closed YifeiAI closed 6 years ago
Hello! Batch normalization should certainly help accelerate the training process. However, I'm uncertain whether using ADAM will be better than SGD. If I recall correctly, getting the network to successfully converge also requires the learning rate to be quite large (around 0.001), and the weights should be initialized with Xavier initialization rather than Gaussian initialization. With these parameters, you should expect convergence after a few thousand iterations. For the implementation in this repository, the network converges when the loss oscillates between 0.04 ~ 0.07 (training time until convergence is typically one night).
Just as an extra FYI: although in this repository we have a pre-trained model trained for 137k iterations (the most mature model we had before the CVPR deadline ^^), I've been told by other people running the code that equivalent testing performance can be achieved with a converged model trained for only a few thousand iterations. So using models that have already converged but haven't been trained for as long as ours should still work fine.
Hi! Thanks very much for the quick answer.
I looked at marvin.hpp about the implementation of Xavier initialization. The implementation is a little bit different from the original paper. In marvin it only divided by fan_in while in the original paper it divided by fan_in + fan_out. So is there any reasons to just use fan_in? Or it is just for the simplicity for the computation reason?
Marvin's implementation of Xavier initialization was inspired by Caffe's original implementation. A blog post here provides a possible explanation as to why there is no fan_out. I would venture to guess that it was just easier to implement at the time. But it seems to work well.
Thanks a lot! Now the network is converged. I am using pytorch's implementation (with fan_out) for Xavier initialization. Seems working.
@YifeiAI I have same problems, share you code by pytorch, please. Thanks, my email: qinziwen@emails.bjut.edu.cn
Hi, Thanks for publishing the code!
I am trying your training procedure and trying to move your code to pytorch. But I find that the network doesn't converge after a long time. Maybe can you share your learning curve figure or could you simply describe after how many iterations could I expect it to converge.
Also I tried to add batch normalization layers after each conv layer and using ADAM instead of SGD. Do you think its helpful to accelerate the training process?
Thank you very much!