Gradient calculation error?

Sunnydreamrain commented 6 years ago

Hi, At the following line for gradient backpropagation, float cur = *(grad_last + col);. cur should be 0? grad_last is the gradient of the last cell state c_t. So when calculating the gradient of c_t in the following line, it should be initialized to 0 (no gradient from c_t+1). And the value gc should equal grad_last. Is this the case?

const float tmp = g2*calc_grad_activation(activation_type, c_val); const float gc = gh_val*mask*tmp + cur;

https://github.com/taolei87/sru/blob/master/cuda_functional.py#L131

Thanks.

taolei87 commented 6 years ago

Hi @Sunnydreamrain

No, grad_last is not necessarily always zero.

In some cases the model will pass the last cell state c_t into subsequent model components. For example, in sequence-to-sequence task (machine translation), the last cell state of the source sentence is provided to the decoder as initial cell state. As a result, when doing gradient back propagation, the gradient (grad_last) of c_t should be passed back, and so does c_t-1, c_t-2 etc.

taolei87 commented 6 years ago

Of course, when c_t is never used in subsequent computation, pytorch would provide a grad_last that's all zeros.

asappresearch / sru

Gradient calculation error? #65