Batching currently takes e.g. 2 second snippets of audio from the source. This makes it highly likely that samples begin in the middle of a note. Let's think about what we'd like to show our algorithm, vs. what we're actually showing it.
We'd like the algorithm to logically listen to an entire album front to back. Ignoring randomisation for now, let's think about what would happen if we cut one 4 minute piece into 2 second snippets and fed it into the training loop one sample at a time. At the beginning of the piece, the wavenet starts with an empty receptive field, which is zero padded to the left. As the receptive field is filled up, we treat the left of the track as if it's padded with 384ms of silence, the length of the receptive field. Now coming up to the next snippet, the same is true again. Again we have 384ms of logical silence to the left. So as we go along, we have logically trained on the piece, but with 384ms of silence cut into it every 2 seconds. This is not really what we wanted the algorithm to see.
Note that there's an exception to this, which is when we have audio beginning from silence, as at the beginning of a track. We definitely need to see some of these examples, because otherwise we can never learn how to start generating audio from nothing.
There are a number of strategies to address this problem. In these we should consider
The amount of wasted computation
The correctness of the loss terms, as discussed above
Some ideas:
Mask the first 384ms of output in the loss. Since we are losing this information, we should make sure that all samples overlap by 384ms, so the lost piece is always learned in another batch. Caveats: 384ms of wasted computation at the beginning of every batch. However we now have correct loss terms.
Pad the convolutions in such a way that the we don't logically have 384ms of silence at the left. This will mean that we don't have to drop the leading loss terms. We'll also need to overlap examples by 384ms. Caveats: we now have to deal with more complicated code where the output and input time steps are not aligned; this will also have to be handled in the sampling loops. changes leak all over the code and are hard to reason about. We don't lose any of our computations though.
Batch a single piece in order. Each fwd pass also returns the hidden states of the receptive field over the whole network. fwd passes can receive such a hidden state, and fill out the internal left paddings of the network with the hidden state. This effectively means that the algorithm is sliding over the track exactly. Caveats: can't randomize within a track, and it's harder to batch with this strategy. It may also be complicated code in the end.
So: try each of these approaches and see what the impact is.
Hypothesis
Training with an improved batching strategy should
Avoid incorrect loss terms which encode chopped notes at the beginning of each example. This should lead to better NLL scores on test and training sets vs baseline.
Every example is fully utilised by the training loop, meaning that it gets to see more data. In 2 second batches, we are losing 14.2% of data to chopped loss terms. This should lead to better NLL scores on test and training sets vs baseline.
Results
Write up the results of your experiment once it has completed and has been analysed. Include links to the treatment run, and also to the baseline if appropriate.
Acceptance Criteria
[ ] Decided which treatments to develop
[ ] Ran experiments vs baseline
[ ] Conclusion. Does the performance improve as expected?
What
Batching currently takes e.g. 2 second snippets of audio from the source. This makes it highly likely that samples begin in the middle of a note. Let's think about what we'd like to show our algorithm, vs. what we're actually showing it.
We'd like the algorithm to logically listen to an entire album front to back. Ignoring randomisation for now, let's think about what would happen if we cut one 4 minute piece into 2 second snippets and fed it into the training loop one sample at a time. At the beginning of the piece, the wavenet starts with an empty receptive field, which is zero padded to the left. As the receptive field is filled up, we treat the left of the track as if it's padded with 384ms of silence, the length of the receptive field. Now coming up to the next snippet, the same is true again. Again we have 384ms of logical silence to the left. So as we go along, we have logically trained on the piece, but with 384ms of silence cut into it every 2 seconds. This is not really what we wanted the algorithm to see.
Note that there's an exception to this, which is when we have audio beginning from silence, as at the beginning of a track. We definitely need to see some of these examples, because otherwise we can never learn how to start generating audio from nothing.
There are a number of strategies to address this problem. In these we should consider
Some ideas:
So: try each of these approaches and see what the impact is.
Hypothesis
Training with an improved batching strategy should
Results
Write up the results of your experiment once it has completed and has been analysed. Include links to the treatment run, and also to the baseline if appropriate.
Acceptance Criteria