Closed jongwook closed 6 years ago
This is a nice implementation that gives us the possibility to use a bigger batch size and to synthesize longer sentences. However, I noticed three problems that can be arranged :
You use torch.split()
to split your mels into smaller mels segments, in order to avoid putting a big tensor directly into the GPU. These small tensor segments are in fact saved into the GPU (see line 50 of inference.py
script - mel = utils.to_gpu(mel)
). Then, you cannot generate an infinite sentence. What you could do is saving your mel segments onto the CPU (remove line 50) and transfer them into the GPU one at a time, in your for
loop. Concretely :
for mel in splits:
mel = utils.to_gpu(mel)
cond_input = model.get_cond_input(mel) # 2R x B x num_layers x samples
del mel
torch.cuda.empty_cache()
audio_data.append(wavenet.infer(cond_input, impl).cpu())
del cond_input
torch.cuda.empty_cache()
To perform batch inference, source code uses mels = torch.cat()
. It doesn't work in this case (see documentation for further explanations). Unless there is a specific Pytorch function that can handle what we want to do, we must create our own :
def pad_list(batch_list):
"""
function to pad values of a batch list
Args:
batch_list (list): list of batch, where the shape of i-th item is (1, n_cond_channels, T_i)
Return:
(tensor): padded batch with the shape (B, n_cond_channels, T_max)
"""
batch_size = len(batch_list)
n_cond_channels = batch_list[0].size(1)
max_len = max([batch.size(2) for batch in batch_list])
batch_pad = torch.zeros(batch_size, n_cond_channels, max_len, dtype=batch_list[0].dtype)
for idx, batch in enumerate(batch_list):
batch_pad[idx, :, :batch.size(2)] = batch
return batch_pad
Last suggestion is about the way you split your mel-spectrogram into smaller segments. I tried with torch.split()
function, but it causes artefact sounds between two consecutive segments. In other words, we can notice that the audio has been generated in multiple steps. I then tried to split mel-spectrograms using an overlap that is inferior to the length of the segment :
for pos in range(0, mels.size(2), 79):
splits.append(mels[:, :, pos: pos + 81])
Using this splitting, you will generate consecutive audio segments which all have a certain amount of samples that are common to each other (n last samples of first segment will be approximately the same n first samples of the second segment, and so on .. ). All you have to do is to process these common samples. I used a Hann window in order to smooth the audio at the frontier of two consecutive segments, and there are no more artefacts !
Thanks a lot for this PR, I wasn't able to perform inference on several audios before your implementation. But now it is easier and I can generate 11 to 13 audios at a time. As a short comparison of performance with my 1080Ti:
Before : 1.64s to generate 1s of audio - Batch size of 1
Now: 0.21s to generate 1s of audio - Batch size of 13
@julianzaidi Thanks for the comments!
I was naively assuming that repeatedly calling nvWavenetInfer.run()
would seamlessly synthesize the audio, but indeed, this implementation produces clicks every second or so. (interestingly, it wasn't really audible for the bass guitar sound I used for testing)
To properly synthesize seamless audio, the nvWavenetInfer
instance should reuse the state from the last invocation of run()
. Specifically, I am updating the implementation to reuse the last activation values stored in m_XtIn
, which is currently being ignored for the first few samples.
In the meantime, the windowed smoothing approach should be a good workaround.
Currently the Pytorch wrapper builds a single giant array for
cond_input
, with which the GPU quickly runs out of memory, making inference for anything longer than 10 seconds difficult.This PR modifies the Pytorch wrapper to run the inference in a streaming manner, by splitting the Mel spectrogram into groups of 80 frames (corresponding to 1 second with the default config) and running the inference one by one.
It seamlessly connects the autoregressive output of the previous split to the next, so this 80-frame splits are not audible at all in the resulting audio.
To achieve this, this PR made the following modifications:
silenceInputs()
call to the constructor ofnvWavenetInfer
. This is the only change in the root directory.infer
to aconstruct
-infer
-destruct
lifecycle, and made the corresponding edits inwavenet_infer.{cu,h}
andwavenet_infer_wrapper.{c,h}
files.nv_wavenet.py
now manages thenvWavenetInfer
instance through its lifecycle and makes the appropriate calls toconstruct
,infer
, anddestruct
functions. It sets them_maxBatch
andm_maxSamples
as the size of the firstcond_input
it takes; this allows running inference with a smaller conditional input than the first, e.g. the last split of the Mel spectrogram.inference.py
deals with splitting the Mel spectrogram and callingnv_wavenet
split-by-split. I've also added a verbose option to run the loop withtqdm
; let me know if the maintainers don't favor this.I'm not sure if you're accepting PRs, but hope you will! I'm open to suggestions as to code formatting or any other issues.