An idea/suggestion on the gradient used

I would like to make an inquiry on the form of gradients computed in get_gradients function. It seems that you have computed dL_t/d_W directly where W is a parameter. While loss gradient on weight is simple enough in feedforward NNs, in RNNs because the same weight is shared at all time steps, each dL_t/d_W is actually a summation of partial derivative products of lengths 1, 2, ..., t respectively. Please see this tutorial for the actual form, in particular results (5) and (6).

Those longer partial derivative products correspond to the backpropagated signals over longer temporal dependencies. If these longer ones vanish (and they are prone to vanishing), then the weights are updated in a way that 'cannot' retain earlier information.

Therefore it occurs to me that if dL_t/d_W stays away from 0, it does not seem to be guaranteed that vanishing gradients did not take place. It might be those shorter partial derivative products more vanishing-resistant that have kept the magnitude of dL_t/d_W away from 0. A more direct indicator could be dh_t/dh_1, or dh_t/dh_0, where h_t is the hidden state at step t. Both are products of result (6) multiplied over the time steps. If say starting from t = 100 such a quantity vanishes, then we can claim the model is unable to retain information for more than 100 steps.

That being said, I am not really an expert in RNN, and I am just raising an idea here. I would really appreciate it if you can take a look at whether my understanding is correct, and whether the statistic dh_t/dh_1, or dh_t/dh_0 can possibly be implemented. Thanks in advance!

OverLordGoldDragon / see-rnn

An idea/suggestion on the gradient used #57