Open lukeyeager opened 9 years ago
Also discussion about this https://github.com/BVLC/caffe/issues/430
In theory when you reduce the batch_size by a factor of X then you should increase the base_lr by a factor of sqrt(X), but Alex have used a factor of X (see http://arxiv.org/abs/1404.5997)
@lukeyeager @mrgloom Is this still relevant with recent paper https://arxiv.org/abs/1706.02677 that says that we should take linear scale, i.e. multiply base_lr by x when batch_size changes by x?
See discussion in #44.
As Alex Krizhevsky explains in his paper One weird trick for parallelizing convolutional neural networks, the learning rate, momentum and weight decay are all dependent on the batch size (see section 5, page 5). It would be nice if DIGITS handled these calculations for you automatically so that you don't have to worry about it.
The issue is that different networks have different default learning rates and batch sizes. Is there a standard equation that fits all networks?