Overfit Goldberg Variations

purzelrakete commented 3 years ago

What

Try to overfit the the Goldberg Variations using the Tracks dataset. We can already overfit fine on a single track, the Goldberg Variations Aria, but not on 2017 maestro. Let's try to use the same track, but read using the full datasets pipeline. The single track dataset is using a different code path, since it's held in memory.

Keep in mind that Track is heavily overlapping examples. Aria itself is 299.52 seconds long. In Track, 80% of this (239.62 seconds) is assigned to the trainset. This is turned into 466 seconds in Track, and truncated to 299 seconds in Tracks. Testing on Tracks vs Track should be made comparable in terms of what is shown to the network.

Need to check, but I think this works out to ~100 * ~240 unique seconds shown to the model in Track across 50 epochs.

Hypothesis

I expect to be able to overfit the Aria as before, using no batch norm. I also expect to be able to overfit a handful of tracks from the same Goldberg Variations recording.

Results

Write up the results of your experiment once it has completed and has been analysed. Include links to the treatment run, and also to the baseline if appropriate.

Acceptance Criteria

[x] Train on full aria using Track codepath
[x] Train on aria using the Tracks codepath
[x] Train on the first 4 variations

purzelrakete commented 3 years ago

Experiment

The baseline here is the Track experiment which was able to successfully overfit the Aria in #27. Things are not really comparable here though, since in Track, the Aria is cut into 50% overlapping segments, whereas in Tracks, we have no overlapping of audio. To make it more comparable, I took out the overlapping, removed the 20% train split, and ran Track again on 100% of the Aria.

Modified tracks experiment track-like-tracks-1631387182:

bin/train track-like-tracks \
  -p batch_norm False 
  -p max_epochs 75

I had to increase from 50 to 75 epochs to obtain comparable samples. This can be motivated by the 50% overlap, which added a factor of 1.5x more passes over the same data. So this is comparable to running 1.5x as many epochs.

Aria only experiment fromdir-1631391580:

bin/train fromdir  \
  -p root_dir /srv/datasets/goldberg/aria \
  -p cache_dir /srv/datasets-ssd/goldberg/aria \
  -p learning_rate 0.0026 \
  -p with_all_chans 256  \
  -p max_epochs 75

Four tracks experiment fromdir-1631393924:

bin/train fromdir  \
  -p root_dir /srv/datasets/goldberg/four \
  -p cache_dir /srv/datasets-ssd/goldberg/four \
  -p learning_rate 0.0026 \
  -p with_all_chans 256  \
  -p max_epochs 100

I did run a further experiment on the whole recording, but that crashed with an exploding loss after a few hours of training. Instead of trying with batch norm, I will defer this to a later experiment, since I've already learned a bit from the other ones.

purzelrakete commented 3 years ago

Results

Training likelihoods only, since we're only interested in overfitting.

	training nll
baseline	0.00065
aria	0.00228
four	0.00976

It seems that you really need to hit around 0.0005 (99.95% likelihood) or less to get it to repeat back the training examples verbatim. In fact the baseline, which is modified as described above, is also not 100% perfectly repeating the audio, which I attribute to the fact that overlapping improves the problems discussed in #7. The beginning of the sample sounds great, but then it trails off a bit.

Both the aria and four experiments manage to overfit down to a loss below 0.01 (99% likelihood), which I would consider almost perfect overfitting, even if you can't exactly retrieve the input.

We can say that

We managed to reproduce the baseline tracks result with more or less perfect sounding audio generated
Results when reading aria via the Tracks dataset, instead of via the Track dataset, are worse by a factor of 3.5x in log space. The generated audio is also audibly worse. But in likelihoods, the baseline is at 99.935%, and the aria at 99.772%. So almost the same in likelihood space.

Insights

I initially trained on the currently standard channels setup, which is

n_chans: 128
n_chans_embed: 256
n_chans_end: 256
n_chans_res: 96
n_chans_skip: 256

This results in a model with ~3MM parameters. I could not get this model to overfit the aria, although maybe with many more epochs I might have managed (can't remember if I tried that or not). But with all 256 chans, I can overfit, and these are the results we see here. That's with a model that has ~13MM parameters.

So maybe not being able to overfit maestro is simply because I have a high bias model unless I take it up to 256 across the board. Note that in the original Wavenet, and also in the Maestro papers, they have 512. I initially turned this way down, because it was taking me 2 days to even train on 1 epoch of 2017 maestro. I have since doubled the speed of the network, but that's still not really viable when you have 2 gpus.

I have also realised something critical. Being able to overfit on larger datasets is a question of capacity. But being able to actually generate random piano, like we can hear in the google samples, is probably super hard. The reason is that you have to model

the instrument
the room acoustics
the recording setup
extraneous noises – breathing in the audience, creaking chairs
the performance. rhythmic touch, dynamics, phrasing
harmonic sound production, notes interacting in chords
the composer. let's model J.S. Bach, by listening to one recording

On the other hand, the model can start to learn all of those things rather naively, e.g. the composer can be imitated by learning only single chords and stringing them together, like in a unigram language model. Nevertheless – it's a hard task. I presume that we can only start to actually draw sounds after giving the model 512 channels of capacity, and all of the maestro dataset. Would have to check the Maestro paper again to see what they did to generate the unconditional model.

Open questions

Is there a problem with the Tracks dataset? It might be good to do a float by float comparison of the two dataset, to see if the results are really the same.
Is it possible to train a conditioned model, and then get it to play unconditioned output? This might be the path to getting pianist-like performances out of the model.

feldberlin / wavenet