Running on arbitrary audio length

jongwook commented 6 years ago

Currently the Pytorch wrapper builds a single giant array for cond_input, with which the GPU quickly runs out of memory, making inference for anything longer than 10 seconds difficult.

This PR modifies the Pytorch wrapper to run the inference in a streaming manner, by splitting the Mel spectrogram into groups of 80 frames (corresponding to 1 second with the default config) and running the inference one by one.

It seamlessly connects the autoregressive output of the previous split to the next, so this 80-frame splits are not audible at all in the resulting audio.

To achieve this, this PR made the following modifications:

moved silenceInputs() call to the constructor of nvWavenetInfer. This is the only change in the root directory.
divided the responsibility of the previously one-off method infer to a construct-infer-destruct lifecycle, and made the corresponding edits in wavenet_infer.{cu,h} and wavenet_infer_wrapper.{c,h} files.
nv_wavenet.py now manages the nvWavenetInfer instance through its lifecycle and makes the appropriate calls to construct, infer, and destruct functions. It sets the m_maxBatch and m_maxSamples as the size of the first cond_input it takes; this allows running inference with a smaller conditional input than the first, e.g. the last split of the Mel spectrogram.
inference.py deals with splitting the Mel spectrogram and calling nv_wavenet split-by-split. I've also added a verbose option to run the loop with tqdm; let me know if the maintainers don't favor this.

I'm not sure if you're accepting PRs, but hope you will! I'm open to suggestions as to code formatting or any other issues.

julianzaidi commented 6 years ago

This is a nice implementation that gives us the possibility to use a bigger batch size and to synthesize longer sentences. However, I noticed three problems that can be arranged :

You use torch.split() to split your mels into smaller mels segments, in order to avoid putting a big tensor directly into the GPU. These small tensor segments are in fact saved into the GPU (see line 50 of inference.py script - mel = utils.to_gpu(mel)). Then, you cannot generate an infinite sentence. What you could do is saving your mel segments onto the CPU (remove line 50) and transfer them into the GPU one at a time, in your for loop. Concretely :
```
for mel in splits:
mel = utils.to_gpu(mel)
cond_input = model.get_cond_input(mel)  # 2R x B x num_layers x samples
del mel
torch.cuda.empty_cache()
audio_data.append(wavenet.infer(cond_input, impl).cpu())
del cond_input
torch.cuda.empty_cache()
```

To perform batch inference, source code uses mels = torch.cat(). It doesn't work in this case (see documentation for further explanations). Unless there is a specific Pytorch function that can handle what we want to do, we must create our own :

def pad_list(batch_list):
"""
function to pad values of a batch list

Args:
    batch_list (list): list of batch, where the shape of i-th item is (1, n_cond_channels, T_i)

Return:
    (tensor): padded batch with the shape (B, n_cond_channels, T_max)
"""
batch_size = len(batch_list)
n_cond_channels = batch_list[0].size(1)
max_len = max([batch.size(2) for batch in batch_list])
batch_pad = torch.zeros(batch_size, n_cond_channels, max_len, dtype=batch_list[0].dtype)

for idx, batch in enumerate(batch_list):
    batch_pad[idx, :, :batch.size(2)] = batch

return batch_pad

Last suggestion is about the way you split your mel-spectrogram into smaller segments. I tried with torch.split() function, but it causes artefact sounds between two consecutive segments. In other words, we can notice that the audio has been generated in multiple steps. I then tried to split mel-spectrograms using an overlap that is inferior to the length of the segment :
```
for pos in range(0, mels.size(2), 79):
splits.append(mels[:, :, pos: pos + 81])
```
Using this splitting, you will generate consecutive audio segments which all have a certain amount of samples that are common to each other (n last samples of first segment will be approximately the same n first samples of the second segment, and so on .. ). All you have to do is to process these common samples. I used a Hann window in order to smooth the audio at the frontier of two consecutive segments, and there are no more artefacts !

Thanks a lot for this PR, I wasn't able to perform inference on several audios before your implementation. But now it is easier and I can generate 11 to 13 audios at a time. As a short comparison of performance with my 1080Ti:

Before : 1.64s to generate 1s of audio - Batch size of 1
Now: 0.21s to generate 1s of audio - Batch size of 13

jongwook commented 6 years ago

@julianzaidi Thanks for the comments!

I was naively assuming that repeatedly calling nvWavenetInfer.run() would seamlessly synthesize the audio, but indeed, this implementation produces clicks every second or so. (interestingly, it wasn't really audible for the bass guitar sound I used for testing)

To properly synthesize seamless audio, the nvWavenetInfer instance should reuse the state from the last invocation of run(). Specifically, I am updating the implementation to reuse the last activation values stored in m_XtIn, which is currently being ignored for the first few samples.

In the meantime, the windowed smoothing approach should be a good workaround.

NVIDIA / nv-wavenet

Running on arbitrary audio length #42