Open stqc opened 4 years ago
I think this can help to stabilize the weight update. Since stochastic gradient descent could sometimes lead to the wrong optimization direction, making the training noisy. Averaging with the recent weights can alleviate this problem.
But I cannot find any paper to support this idea.
I found the comment under this issue could be the answer:
Can you please explain how computing moving averages help ?