Improve training stability with better initialisation

purzelrakete commented 3 years ago

What

After a couple of days of training on Maestro, the model collapsed. See the run here. Gradients are going up to values of 20k, and in some cases biases have values of 1000.

To improve the stability, initialisation will have to be looked at carefully.

Hypothesis

Exploding gradients prevent further training on large datasets. This is due to incorrect initialisation of the network. Introducing the correct normalisation will resolve this problem, and a complete run on Maestro will complete without issue. Convergence of the network should also speed up.

Results

Write up the results of your experiment once it has completed and has been analysed. Include links to the treatment run, and also to the baseline if appropriate.

Acceptance Criteria

[ ] read papers
[x] visualise the training dynamics
[ ] show that there's a problem with the current random initialisation
[ ] implement the expected initialisation strategy
[ ] Initialise 𝛾 to 0 on residual branches in batch norm layer

Also:

[ ] Check the singular values of the input output Jacobian. Are they close to 1? Visualise.

purzelrakete commented 3 years ago

Pytorch initialisation defaults

I assumed sane defaults in pytorch. As it turns out, pytorch still doesn't have good default initialisations: https://github.com/pytorch/pytorch/issues/18182.

Consequently we should have our own initialisations adapted to Wavenet.

Inspiration

The paper Convolutional Sequence to Sequence Learning shows a principled approach to initialising a convolutional network which uses Gated Linear Units.

purzelrakete commented 3 years ago

Papers

Some key papers looking at different initialisation strategies.

Basics

Understanding the difficulty of training deep feedforward neural networks Glorot. This is about network attenuation in forward and backward passes, where the network is completely linear. When outputs are around zero, this is the case with tanh activations.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification He. Same as above but this time with Relu activations.

Other ideas

The Shattered Gradients Problem: If ResNets are the answer, then what is the question? Balduzzi, Frean et al. Instead of looking at attenuation, this paper looks at the effect of depth on gradient correlation. In the paper they say that this issue is orthogonal to exploding or vanishing gradients, but it woul be interesting to measure this on the network anyway.

feldberlin / wavenet