Open purzelrakete opened 3 years ago
I assumed sane defaults in pytorch. As it turns out, pytorch still doesn't have good default initialisations: https://github.com/pytorch/pytorch/issues/18182.
Consequently we should have our own initialisations adapted to Wavenet.
The paper Convolutional Sequence to Sequence Learning shows a principled approach to initialising a convolutional network which uses Gated Linear Units.
Some key papers looking at different initialisation strategies.
What
After a couple of days of training on Maestro, the model collapsed. See the run here. Gradients are going up to values of 20k, and in some cases biases have values of 1000.
To improve the stability, initialisation will have to be looked at carefully.
Hypothesis
Exploding gradients prevent further training on large datasets. This is due to incorrect initialisation of the network. Introducing the correct normalisation will resolve this problem, and a complete run on Maestro will complete without issue. Convergence of the network should also speed up.
Results
Write up the results of your experiment once it has completed and has been analysed. Include links to the treatment run, and also to the baseline if appropriate.
Acceptance Criteria
Also: