feldberlin / wavenet

An unconditioned Wavenet implementation with fast generation.
3 stars 0 forks source link

Batchnorm #18

Closed purzelrakete closed 3 years ago

purzelrakete commented 3 years ago

What

Test the effect of Batch Normalisation on test / train likelihood and training times.

Hypothesis

The current network is having some stability issues. This is likely due to a poor initialisation strategy, which is investigated in #11.

A different strategy is to batch normalise the network. Batch normed networks are able to overcome poor initialisation, and often reach better accuracy and training times than careful initialisation alone.

Consequently Batch norm

Note that no other public Wavenet implementations uses Batch norm. I don't immediately see a reason why Batch norm should be inappropriate for a Wavenet, but need to think this through properly.

Results

Write up the results of your experiment once it has completed and has been analysed. Include links to the treatment run, and also to the baseline if appropriate.

Acceptance Criteria

purzelrakete commented 3 years ago

Experiments

Baseline experiment notebook. Launched with:

bin/train maestro -p batch_size 24 -p learning_rate 0.005 -p max_epochs 2

I could not get a longer experiment to run without running into nan loss values. Instead I fixed it at 2 epochs, with the batch size and learning rate maxed out for the network config we are using. We can't really compare treatment and baseline in this experiment.

Treatment experiment notebook. Launched with:

bin/train maestro -p batch_size 12 -p batch_norm True -p learning_rate 0.05 -p max_epochs 12

I added BatchNorm as a feature flag to the config. Running BatchNorm on Tiny and Sines shows a somewhat worse outcome when using a random decoder. I don't know why this should be the case.

There's a possibility that I introduced an error into the version which runs when batch norm is turned off. It's hard to say since I'm just eyeballing results in Tiny and Sines. It would be helpful to have some metrics for those datasets beyond just likelihood scores.

purzelrakete commented 3 years ago

Results

This experiment doesn't have readily interpretable results, since the baseline ran for 2 epochs, while the treatment ran for 12. Here are the metrics at the last step:

train nll test nll
baseline 1.575 1.453
treatment 1.052 1.322

We can say that

Insights

Reducing n_chans to 128 and n_chans_res to 128 or 96 reduces the number of parameters by a factor of ~4x, while improving performance on a 2 epoch run, resulting in around 3MM parameters. The full model before was setting all channels to 256, which results in a model with around 13MM parameters.

Running for 2 epochs gives results that train something like 10x faster per epoch. The resulting test loss for this network is also better than before. These settings introduce a bottleneck into the residual blocks:

(0): ResBlock(
    (conv): Conv1d(128, 192, kernel_size=(2,), stride=(1,))
    (res1x1): Conv1d(96, 128, kernel_size=(1,), stride=(1,))
    (skip1x1): Conv1d(96, 256, kernel_size=(1,), stride=(1,))
  )

This suggests that it might be a good idea to look more into this bottleneck.

Some more insights:

Open Questions