Open Pozimek opened 5 years ago
I've come to think that changing L33, L34 and L36 to use c.detach() should fix this issue, but I'm not very confident about this.
ci = torch.sigmoid(self.Wxi(x) + self.Whi(h) + c.detach() * self.Wci)
IMO gradient should flow through c only via the operations in L35 and L37.
The LSTM paper defines a specific rule for gradient updates of the 'peephole' connections. Specifically:
Based on my understanding of the code the way these 3 variables are initialized (as asked in Issue 17) is an attempt at implementing this update rule, but I don't see how does initializing them as Variables helps. From my understanding of the quoted part of the LSTM paper, the peephole connections should be updated but the gradient that updates them should stop there and not flow any further. If that is the case then this implementation is incorrect, although it might be that Pytorch does not support such an operation as .detach() is not suitable for the job.