Embed inputs - Githubissues

purzelrakete commented 3 years ago

What

Reproduce the train.track baseline, then change embed_inputs=True and

Hypothesis

I would like to know that embed_inputs works properly, and that performance on a real world audio snippet is improved quite a bit vs the baseline.

Anecdotally, modelling the inputs as categorical variables yields significantly better results. It's not clear to me why this must be true. The inputs of the digital sound file are ordinal in nature, since PCM data is quantised. However the underlying signal is continuous. Consequently it seems more natural to model this using continuous inputs. This is also often done in the image domain, where images are normalised before being passed into networks. I don't recall seeing any vision work where images are one hot encoded before being passed into the network.

On the other hand, the two main oss wavenet implementations both seem to have started with continuous inputs and then moved to categorical encodings, since this massively improved the performance. It even seems that the google wavenet implementation did something similar, as the original authors stated privately. This seems consistent with their previous papers, e.g. PixelCNN, where everything was treated as a categorical variable.

Results

Write up the results of your experiment once it has completed and has been analysed. Include links to the treatment run, and also to the baseline if appropriate.

Acceptance Criteria

[x] reproduced train.track baseline
[x] ran the treatment: embedded inputs
[x] compared and wrote up results

purzelrakete commented 3 years ago

Experiments

Treatment experiment metrics here.

Run with embedded inputs using a single track (aria.wav). Squashed to mono.

I was initially not able to generate a track from this run. I thought that this might be happening due to a problem in folding stereo. Input embeddings worked really well for the mono sinusoids, and also for the tiny dataset. The main difference I could see with the track dataset is that it's stereo.

🥇 Training with a mono track. Loss @6: 3.0. With embedded inputs loss @6: 2.2. Looks like a stereo bug.

I then wrote a test which fed mono and stereo directly into theInputEmbedding layer and chcked results directly. I found a problem and fixed it.

purzelrakete commented 3 years ago

Results

Treatment is really overfitting, which is what's expected with a single track. Keeping in mind that stereo has twice as many logits to calculate the loss over so the numbers can't be directly compared. It seems clear that the model is able to converge massively faster. The training loss is now down to almost zero, which is exactly what I would want to see when trying to overfit a single track.

Insights

It's possible to completely memorize mono audio input with this model
The audio generated with nucleus sampling sounds exactly like the input

feldberlin / wavenet

Embed inputs #4

What

Hypothesis

Results

Acceptance Criteria

Experiments

Results

Insights