NVIDIA / nv-wavenet

Reference implementation of real-time autoregressive wavenet inference
BSD 3-Clause "New" or "Revised" License
735 stars 126 forks source link

Running on arbitrary audio length #42

Closed jongwook closed 6 years ago

jongwook commented 6 years ago

Currently the Pytorch wrapper builds a single giant array for cond_input, with which the GPU quickly runs out of memory, making inference for anything longer than 10 seconds difficult.

This PR modifies the Pytorch wrapper to run the inference in a streaming manner, by splitting the Mel spectrogram into groups of 80 frames (corresponding to 1 second with the default config) and running the inference one by one.

It seamlessly connects the autoregressive output of the previous split to the next, so this 80-frame splits are not audible at all in the resulting audio.

To achieve this, this PR made the following modifications:

I'm not sure if you're accepting PRs, but hope you will! I'm open to suggestions as to code formatting or any other issues.

julianzaidi commented 6 years ago

This is a nice implementation that gives us the possibility to use a bigger batch size and to synthesize longer sentences. However, I noticed three problems that can be arranged :

Thanks a lot for this PR, I wasn't able to perform inference on several audios before your implementation. But now it is easier and I can generate 11 to 13 audios at a time. As a short comparison of performance with my 1080Ti:

  1. Before : 1.64s to generate 1s of audio - Batch size of 1

  2. Now: 0.21s to generate 1s of audio - Batch size of 13

jongwook commented 6 years ago

@julianzaidi Thanks for the comments!

I was naively assuming that repeatedly calling nvWavenetInfer.run() would seamlessly synthesize the audio, but indeed, this implementation produces clicks every second or so. (interestingly, it wasn't really audible for the bass guitar sound I used for testing)

To properly synthesize seamless audio, the nvWavenetInfer instance should reuse the state from the last invocation of run(). Specifically, I am updating the implementation to reuse the last activation values stored in m_XtIn, which is currently being ignored for the first few samples.

In the meantime, the windowed smoothing approach should be a good workaround.