Closed danijar closed 8 years ago
Randall & Martinez, 2013 imply that both are common. While the sum is mathematically more correct, the average is more practical since it divides the learning rate by the batch size and makes batch gradient decent more stable.
Just keep the implementation as it is and add a docstring describing what's used.
Currently for large batch sizes a large learning rate is required. Maybe that's because I average gradients. Find out if batches are summed or averaged usually.