feldberlin / wavenet

An unconditioned Wavenet implementation with fast generation.
3 stars 0 forks source link

Improve training stability with better initialisation #11

Open purzelrakete opened 3 years ago

purzelrakete commented 3 years ago

What

After a couple of days of training on Maestro, the model collapsed. See the run here. Gradients are going up to values of 20k, and in some cases biases have values of 1000.

To improve the stability, initialisation will have to be looked at carefully.

Hypothesis

Exploding gradients prevent further training on large datasets. This is due to incorrect initialisation of the network. Introducing the correct normalisation will resolve this problem, and a complete run on Maestro will complete without issue. Convergence of the network should also speed up.

Results

Write up the results of your experiment once it has completed and has been analysed. Include links to the treatment run, and also to the baseline if appropriate.

Acceptance Criteria

Also:

purzelrakete commented 3 years ago

Pytorch initialisation defaults

I assumed sane defaults in pytorch. As it turns out, pytorch still doesn't have good default initialisations: https://github.com/pytorch/pytorch/issues/18182.

Consequently we should have our own initialisations adapted to Wavenet.

Inspiration

The paper Convolutional Sequence to Sequence Learning shows a principled approach to initialising a convolutional network which uses Gated Linear Units.

purzelrakete commented 3 years ago

Papers

Some key papers looking at different initialisation strategies.

Basics

Other ideas