As far as I noticed, Lstm module in Pytorch initializes hidden state as zero. In some blog posts (also by Hinton), it is recommended to initialize it as a learnable parameter. I believe between initial hidden state and next hidden states, the network got affected less by covariate shift when the initial hidden state learned. It would be great if it can be implemented, it may boost accuracy.
Hi,
As far as I noticed, Lstm module in Pytorch initializes hidden state as zero. In some blog posts (also by Hinton), it is recommended to initialize it as a learnable parameter. I believe between initial hidden state and next hidden states, the network got affected less by covariate shift when the initial hidden state learned. It would be great if it can be implemented, it may boost accuracy.
Thanks.