feldberlin / wavenet

An unconditioned Wavenet implementation with fast generation.
3 stars 0 forks source link

Improve batching #7

Open purzelrakete opened 3 years ago

purzelrakete commented 3 years ago

What

Batching currently takes e.g. 2 second snippets of audio from the source. This makes it highly likely that samples begin in the middle of a note. Let's think about what we'd like to show our algorithm, vs. what we're actually showing it.

We'd like the algorithm to logically listen to an entire album front to back. Ignoring randomisation for now, let's think about what would happen if we cut one 4 minute piece into 2 second snippets and fed it into the training loop one sample at a time. At the beginning of the piece, the wavenet starts with an empty receptive field, which is zero padded to the left. As the receptive field is filled up, we treat the left of the track as if it's padded with 384ms of silence, the length of the receptive field. Now coming up to the next snippet, the same is true again. Again we have 384ms of logical silence to the left. So as we go along, we have logically trained on the piece, but with 384ms of silence cut into it every 2 seconds. This is not really what we wanted the algorithm to see.

Note that there's an exception to this, which is when we have audio beginning from silence, as at the beginning of a track. We definitely need to see some of these examples, because otherwise we can never learn how to start generating audio from nothing.

There are a number of strategies to address this problem. In these we should consider

  1. The amount of wasted computation
  2. The correctness of the loss terms, as discussed above

Some ideas:

So: try each of these approaches and see what the impact is.

Hypothesis

Training with an improved batching strategy should

Results

Write up the results of your experiment once it has completed and has been analysed. Include links to the treatment run, and also to the baseline if appropriate.

Acceptance Criteria