We need to make sure that the random weight initialisation parameters we use (irange for uniform initialisation of convolutional layers and istdev for Gaussian initialisation of fully connected and softmax layers) do not force the initial weight kernel / matrix column norms to be higher than the max_kernel_norm and max_column_norm constraints in the model specification as otherwise the norm constraint will severely distort the updates applied to the weights making learning very hard and potentially unstable.
Particularly for the fully connected layer after the last convolutional layer this appears to potentially have been an issue as the input space to this layer and so column dimension of the weight matrix is very high meaning that with even very small istdev values the norms of the columns of the random initialised weight matrix appear to be above the maximum constraint. We think this might be at least part of the reason we are sometimes getting strange learning curves where the NLL suddenly jumps and/or gets stuck at some high value.
Simple way to set is to find weight kernel / matrix dimensions by running print_model.py script on a model pickle (this gives the sizes of the input/output spaces of each of the layers from which the relevant weight kernel / matrix dimensions can be inferred) then finding the irange / istddev parameter that will give an expected initial kernel / column norm some scale factor in [0,1] times by the maximum constraint. Expected kernel / column norms for a given dimensionality and initialisation distribution can either be computed analytically or just doing a simple Monte Carlo estimation. The kernel norms are calculated by pylearn2 as sqrt(sum(W**2, axis=(1, 2, 3))) and the column norms as sqrt(sum(W**2, axis=0)).
We need to make sure that the random weight initialisation parameters we use (
irange
for uniform initialisation of convolutional layers andistdev
for Gaussian initialisation of fully connected and softmax layers) do not force the initial weight kernel / matrix column norms to be higher than themax_kernel_norm
andmax_column_norm
constraints in the model specification as otherwise the norm constraint will severely distort the updates applied to the weights making learning very hard and potentially unstable.Particularly for the fully connected layer after the last convolutional layer this appears to potentially have been an issue as the input space to this layer and so column dimension of the weight matrix is very high meaning that with even very small
istdev
values the norms of the columns of the random initialised weight matrix appear to be above the maximum constraint. We think this might be at least part of the reason we are sometimes getting strange learning curves where the NLL suddenly jumps and/or gets stuck at some high value.Simple way to set is to find weight kernel / matrix dimensions by running
print_model.py
script on a model pickle (this gives the sizes of the input/output spaces of each of the layers from which the relevant weight kernel / matrix dimensions can be inferred) then finding theirange
/istddev
parameter that will give an expected initial kernel / column norm some scale factor in [0,1] times by the maximum constraint. Expected kernel / column norms for a given dimensionality and initialisation distribution can either be computed analytically or just doing a simple Monte Carlo estimation. The kernel norms are calculated by pylearn2 assqrt(sum(W**2, axis=(1, 2, 3)))
and the column norms assqrt(sum(W**2, axis=0))
.