Open Sunnydreamrain opened 6 years ago
Hi @Sunnydreamrain
No, grad_last
is not necessarily always zero.
In some cases the model will pass the last cell state c_t
into subsequent model components. For example, in sequence-to-sequence task (machine translation), the last cell state of the source sentence is provided to the decoder as initial cell state. As a result, when doing gradient back propagation, the gradient (grad_last
) of c_t
should be passed back, and so does c_t-1
, c_t-2
etc.
Of course, when c_t
is never used in subsequent computation, pytorch would provide a grad_last
that's all zeros.
Hi, At the following line for gradient backpropagation,
float cur = *(grad_last + col);
.cur
should be 0?grad_last
is the gradient of the last cell statec_t
. So when calculating the gradient ofc_t
in the following line, it should be initialized to 0 (no gradient from c_t+1). And the valuegc
should equalgrad_last
. Is this the case?const float tmp = g2*calc_grad_activation(activation_type, c_val); const float gc = gh_val*mask*tmp + cur;
https://github.com/taolei87/sru/blob/master/cuda_functional.py#L131
Thanks.