Batchnorm - Githubissues

purzelrakete commented 3 years ago

What

Test the effect of Batch Normalisation on test / train likelihood and training times.

Hypothesis

The current network is having some stability issues. This is likely due to a poor initialisation strategy, which is investigated in #11.

A different strategy is to batch normalise the network. Batch normed networks are able to overcome poor initialisation, and often reach better accuracy and training times than careful initialisation alone.

Consequently Batch norm

is a good baseline for what might be possible in batch norm free methods such as weight norm.
might be a viable alternative to those methods.

Note that no other public Wavenet implementations uses Batch norm. I don't immediately see a reason why Batch norm should be inappropriate for a Wavenet, but need to think this through properly.

Results

Write up the results of your experiment once it has completed and has been analysed. Include links to the treatment run, and also to the baseline if appropriate.

Acceptance Criteria

[x] BN is applied before the nonlinearities
[x] BN can be toggled on and off
[x] BN implemented inside custom modules. Introduce one for the 1x1s
[x] Model size adjusted to run on 2x 2080ti
[x] Run a baseline with BN off, same model size. Tune to an optimal non crashing learning rate and then run the basline experiment to completion (2 epochs)
[x] Run treatment with tuned learning rate. Should be much higher. Run treatment to completion.
[x] Normalise the embedding layer too
[x] Check that gammas and betas are applied per feature map, not per activation
[x] Add BN to final b1x1
[x] Remove bias terms in the convolutions when BN is active. Try to increase the batch size since we saved some memory
[x] Unit test that Generator and Wavenet are producing the same results
[x] Make sure that Tiny runs through and looks sane, with and without BN
[x] Make sure that Sines runs through and looks same, with and without BN
[x] Experiment is written up and linked from somewhere

purzelrakete commented 3 years ago

Experiments

Baseline experiment notebook. Launched with:

bin/train maestro -p batch_size 24 -p learning_rate 0.005 -p max_epochs 2

I could not get a longer experiment to run without running into nan loss values. Instead I fixed it at 2 epochs, with the batch size and learning rate maxed out for the network config we are using. We can't really compare treatment and baseline in this experiment.

Treatment experiment notebook. Launched with:

bin/train maestro -p batch_size 12 -p batch_norm True -p learning_rate 0.05 -p max_epochs 12

I added BatchNorm as a feature flag to the config. Running BatchNorm on Tiny and Sines shows a somewhat worse outcome when using a random decoder. I don't know why this should be the case.

There's a possibility that I introduced an error into the version which runs when batch norm is turned off. It's hard to say since I'm just eyeballing results in Tiny and Sines. It would be helpful to have some metrics for those datasets beyond just likelihood scores.

purzelrakete commented 3 years ago

Results

This experiment doesn't have readily interpretable results, since the baseline ran for 2 epochs, while the treatment ran for 12. Here are the metrics at the last step:

	train nll	test nll
baseline	1.575	1.453
treatment	1.052	1.322

We can say that

Batch norm overcame the poor init
Batch norm does not appear to have better training times in this setting, with 2 epochs
Batch norm does not appear to have better test loss values in this setting, with 2 epochs
Batch norm is a viable method in this Wavenet, and results are generated largely as before

Insights

Reducing n_chans to 128 and n_chans_res to 128 or 96 reduces the number of parameters by a factor of ~4x, while improving performance on a 2 epoch run, resulting in around 3MM parameters. The full model before was setting all channels to 256, which results in a model with around 13MM parameters.

Running for 2 epochs gives results that train something like 10x faster per epoch. The resulting test loss for this network is also better than before. These settings introduce a bottleneck into the residual blocks:

(0): ResBlock(
    (conv): Conv1d(128, 192, kernel_size=(2,), stride=(1,))
    (res1x1): Conv1d(96, 128, kernel_size=(1,), stride=(1,))
    (skip1x1): Conv1d(96, 256, kernel_size=(1,), stride=(1,))
  )

This suggests that it might be a good idea to look more into this bottleneck.

Some more insights:

Batch norm allows a much higher learning rate to be set
Batch norm removes the original exploding loss problem
Running experiments with 2 epochs might be a bit misleading. I've switched to 10 on 1 hear of maestro data

Open Questions

Can we still run to near zero loss on the track dataset and fully memorise the track? I was using 4x as many parameters before.

feldberlin / wavenet

Batchnorm #18

What

Hypothesis

Results

Acceptance Criteria

Experiments

Results

Insights

Open Questions