SGD Improvements: Minibatch and Scheduling

SGD can use two simple improvements:

Minibatch sampling
Scheduling

Minibatch sampling

Minibatching can help control the convergence/training time tradeoff (may be wrong about the former claim).

Not entirely certain of an optimized implementation for this; however, here's a sketch of a naive implementation given a matrix of feature vectors X and an vector labels y.

number_of_samples = 10000
minibatch_size = 100 # or minibatch_fraction instead?
sample_indices = sample_without_replacement(number_of_samples, minibatch_size)
X_minibatch = X[:, sample_indices]
y_minibatch = y[sample_indices]

yhat = fmPredict(fm, minibatch, fSum, fSumSqr)
mult = fmLoss(yhat, y)
fmSGDUpdate!(fm, alpha, minibatch_indices, X_minibatch, mult, fSum)

Some changes to consider:

Fixed minibatch size vs. minibatch fraction of the dataset?
Minibatch sampling algorithm sample_without_replacement; could this be implemented in an efficient manner? Perhaps it would be more efficient to implement a sliding window over the dataset e.g.

X_minibatch = X[:, i:i + minibatch_size - 1]
y_minibatch = y[i:i + minibatch_size - 1]

However, the minibatches will not vary per epoch in this scenario.

Scheduling

Set the learning rate alpha to decrease proportional to the square root of the current iteration sqrt(iteration). To prevent division by zero: note that iteration > 0.

alpha = alpha0 / sqrt(iteration)

btwardow / FactorizationMachines.jl

SGD Improvements: Minibatch and Scheduling #3