Closed purzelrakete closed 3 years ago
The baseline here is the Track
experiment which was able to successfully overfit the Aria in #27. Things are not really comparable here though, since in Track
, the Aria is cut into 50% overlapping segments, whereas in Tracks
, we have no overlapping of audio. To make it more comparable, I took out the overlapping, removed the 20% train split, and ran Track
again on 100% of the Aria.
Modified tracks experiment track-like-tracks-1631387182:
bin/train track-like-tracks \
-p batch_norm False
-p max_epochs 75
I had to increase from 50 to 75 epochs to obtain comparable samples. This can be motivated by the 50% overlap, which added a factor of 1.5x more passes over the same data. So this is comparable to running 1.5x as many epochs.
Aria only experiment fromdir-1631391580:
bin/train fromdir \
-p root_dir /srv/datasets/goldberg/aria \
-p cache_dir /srv/datasets-ssd/goldberg/aria \
-p learning_rate 0.0026 \
-p with_all_chans 256 \
-p max_epochs 75
Four tracks experiment fromdir-1631393924:
bin/train fromdir \
-p root_dir /srv/datasets/goldberg/four \
-p cache_dir /srv/datasets-ssd/goldberg/four \
-p learning_rate 0.0026 \
-p with_all_chans 256 \
-p max_epochs 100
I did run a further experiment on the whole recording, but that crashed with an exploding loss after a few hours of training. Instead of trying with batch norm, I will defer this to a later experiment, since I've already learned a bit from the other ones.
Training likelihoods only, since we're only interested in overfitting.
training nll | |
---|---|
baseline | 0.00065 |
aria | 0.00228 |
four | 0.00976 |
It seems that you really need to hit around 0.0005 (99.95% likelihood) or less to get it to repeat back the training examples verbatim. In fact the baseline, which is modified as described above, is also not 100% perfectly repeating the audio, which I attribute to the fact that overlapping improves the problems discussed in #7. The beginning of the sample sounds great, but then it trails off a bit.
Both the aria and four experiments manage to overfit down to a loss below 0.01 (99% likelihood), which I would consider almost perfect overfitting, even if you can't exactly retrieve the input.
We can say that
Tracks
dataset, instead of via the Track
dataset, are worse by a factor of 3.5x in log space. The generated audio is also audibly worse. But in likelihoods, the baseline is at 99.935%, and the aria at 99.772%. So almost the same in likelihood space. I initially trained on the currently standard channels setup, which is
n_chans: 128
n_chans_embed: 256
n_chans_end: 256
n_chans_res: 96
n_chans_skip: 256
This results in a model with ~3MM parameters. I could not get this model to overfit the aria, although maybe with many more epochs I might have managed (can't remember if I tried that or not). But with all 256 chans, I can overfit, and these are the results we see here. That's with a model that has ~13MM parameters.
So maybe not being able to overfit maestro is simply because I have a high bias model unless I take it up to 256 across the board. Note that in the original Wavenet, and also in the Maestro papers, they have 512. I initially turned this way down, because it was taking me 2 days to even train on 1 epoch of 2017 maestro. I have since doubled the speed of the network, but that's still not really viable when you have 2 gpus.
I have also realised something critical. Being able to overfit on larger datasets is a question of capacity. But being able to actually generate random piano, like we can hear in the google samples, is probably super hard. The reason is that you have to model
On the other hand, the model can start to learn all of those things rather naively, e.g. the composer can be imitated by learning only single chords and stringing them together, like in a unigram language model. Nevertheless – it's a hard task. I presume that we can only start to actually draw sounds after giving the model 512 channels of capacity, and all of the maestro dataset. Would have to check the Maestro paper again to see what they did to generate the unconditional model.
Tracks
dataset? It might be good to do a float by float comparison of the two dataset, to see if the results are really the same.
What
Try to overfit the the Goldberg Variations using the
Tracks
dataset. We can already overfit fine on a single track, the Goldberg Variations Aria, but not on 2017 maestro. Let's try to use the same track, but read using the full datasets pipeline. The single track dataset is using a different code path, since it's held in memory.Keep in mind that
Track
is heavily overlapping examples. Aria itself is 299.52 seconds long. InTrack
, 80% of this (239.62 seconds) is assigned to the trainset. This is turned into 466 seconds inTrack
, and truncated to 299 seconds inTracks
. Testing onTracks
vsTrack
should be made comparable in terms of what is shown to the network.Need to check, but I think this works out to ~100 * ~240 unique seconds shown to the model in
Track
across 50 epochs.Hypothesis
I expect to be able to overfit the Aria as before, using no batch norm. I also expect to be able to overfit a handful of tracks from the same Goldberg Variations recording.
Results
Write up the results of your experiment once it has completed and has been analysed. Include links to the treatment run, and also to the baseline if appropriate.
Acceptance Criteria
Track
codepathTracks
codepath