Minibatching can help control the convergence/training time tradeoff (may be wrong about the former claim).
Not entirely certain of an optimized implementation for this; however, here's a sketch of a naive implementation given a matrix of feature vectors X and an vector labels y.
Fixed minibatch size vs. minibatch fraction of the dataset?
Minibatch sampling algorithm sample_without_replacement; could this be implemented in an efficient manner? Perhaps it would be more efficient to implement a sliding window over the dataset e.g.
However, the minibatches will not vary per epoch in this scenario.
Scheduling
Set the learning rate alpha to decrease proportional to the square root of the current iteration sqrt(iteration). To prevent division by zero: note that iteration > 0.
SGD can use two simple improvements:
Minibatch sampling
Minibatching can help control the convergence/training time tradeoff (may be wrong about the former claim).
Not entirely certain of an optimized implementation for this; however, here's a sketch of a naive implementation given a matrix of feature vectors
X
and an vector labelsy
.Some changes to consider:
sample_without_replacement
; could this be implemented in an efficient manner? Perhaps it would be more efficient to implement a sliding window over the dataset e.g.However, the minibatches will not vary per epoch in this scenario.
Scheduling
Set the learning rate
alpha
to decrease proportional to the square root of the current iterationsqrt(iteration)
. To prevent division by zero: note thatiteration > 0
.