Weight initialization scheme, identity matrix for hidden-to-hidden, why?

carlthome commented 7 years ago

First, I love the empirical reasoning about keeping initial gamma values smaller than unit variance. Good stuff! :dancer:

However, I'm curious why you didn't just use orthogonal initialization throughout your experiments. Your paper states that for particular tasks you got better generalization with the batch normalized LSTM when you let the hidden-to-hidden weights start from the identity matrix. Do you think this has more to do with the specific tasks than the model? It struck me as a fairly exotic initialization and thus I worry that it might be especially important with your particular model (that I'm trying to reimplement). Could you elaborate a little, please?

Also, is it the hidden-to-hidden weights for the input (g in the paper, eq. 6), or also the gates (e.g. is this right?)?

cooijmanstim commented 7 years ago

Thanks for your interest! I also feel that the orthogonal vs identity initialization is a task-dependent distraction. We don't conclude anything about these initializations in our paper and I think both will yield very similar results. However we happened to try both on some of the tasks, and reported the best results.

To be precise about our initialization: we apply the Lasagne orthogonal initialization (https://github.com/cooijmanstim/recurrent-batch-normalization/blob/master/penntreebank.py#L39) to the rectangular W_h and W_x (both shaped 4*n by n where n is the number of hidden units). This is what's referred to in the paper as "orthogonal" initialization. It's a bit nonsensical to use "orthogonal" initialization on rectangular matrices, but that's what we did.

"Identity" initialization is the same, except that the hidden-to-hidden submatrix (the weights for g) are set to the identity.

So the code you link to does not quite match what we did, however it is what we should have done. If you're trying to implement it, then don't fret too much about the initialization scheme. :-)

carlthome commented 7 years ago

Thanks! That helped a lot. Yeah, I'm also lazily doing SVD on the rectangular W_h and W_x to determine an orthogonal basis. :panda_face:

I've added your recurrent batch normalization to a ConvLSTM for TensorFlow here. I think I got it right. Care to look at it / test it out?

cooijmanstim commented 7 years ago

I'm not familiar with tensorflow.contrib.layers.batch_norm, but if I understand correctly I don't think you're keeping separate population (aka "moving average") statistics for separate timesteps.

carlthome commented 7 years ago

Thanks! Update coming.

cooijmanstim / recurrent-batch-normalization

Weight initialization scheme, identity matrix for hidden-to-hidden, why? #3