Allow for gradient clipping

johann-petrak commented 6 years ago

Especially important with RNNs such as LSTMs

johann-petrak commented 6 years ago

With RNNs/LSTMs there seems to be no ready built-in way to do this right, because there we need to do the clipping for each iteration when doing BPTT. This should probably be done using a hook on the proper weight variable(s) for which we calculate the gradient for each time step during BPTT.

For all other weights, gradient clipping can be done by clamping the gradient to some fixed min/max value on each dimension or better, normalize by how much a norm (L2, infinit norm) of the gradient vector exceeds a thresshold value: g_new = (th / ||g||) * g_old if ||g|| > th

johann-petrak commented 6 years ago

For the vanishing gradient problem, there is a suggestion based on a regularization term in Pascanu et al 2012 (https://arxiv.org/pdf/1211.5063.pdf).

GateNLP / gate-lf-pytorch-json

Allow for gradient clipping #21