Open johann-petrak opened 6 years ago
With RNNs/LSTMs there seems to be no ready built-in way to do this right, because there we need to do the clipping for each iteration when doing BPTT. This should probably be done using a hook on the proper weight variable(s) for which we calculate the gradient for each time step during BPTT.
For all other weights, gradient clipping can be done by clamping the gradient to some fixed min/max value on each dimension or better, normalize by how much a norm (L2, infinit norm) of the gradient vector exceeds a thresshold value: g_new = (th / ||g||) * g_old
if ||g|| > th
For the vanishing gradient problem, there is a suggestion based on a regularization term in Pascanu et al 2012 (https://arxiv.org/pdf/1211.5063.pdf).
Especially important with RNNs such as LSTMs