bshall / UniversalVocoding

A PyTorch implementation of "Robust Universal Neural Vocoding"
https://bshall.github.io/UniversalVocoding/
MIT License
237 stars 41 forks source link

Generating samples from generated Mel-spectrograms #13

Closed francislata closed 4 years ago

francislata commented 4 years ago

@bshall - First of all, thank you for this implementation. In this issue, you pointed out that you've generated a sample audio from generated Mel-spectrogram from VQVAE. It sounds pretty good.

My question is: how would one go about generating audio from Mel-spectrograms? Do we need to preprocess the Mel-spectrogram, if that's the only thing we're given?

bshall commented 4 years ago

Hi @francislata,

So the generate.py script does generate audio from Mel-spectrograms (if you look at the code it converts the raw audio into a Mel-spectrogram and then feeds that to the vocoder). If you want to use spectrograms created from another process (like tacotron or something) they need to use the same parameters as I've used. You can find the parameters in config.json and the steps I used for preprocessing in preprocess.py.

francislata commented 4 years ago

@bshall - Can you be more specific which parameters that needs to match?

If the Mel-spectrogram given to me is generated by any TTS system, then can I just not take that and put it through the vocoder?

The generated audio by following the padding of the Mel-spectrogram in preprocess.py creates a silent audio throughout. So I'm wondering how you preprocessed the Mel-spectrogram you sampled here to make it produce the sound without having the reference waveform at all.

bshall commented 4 years ago

Hi @francislata, sorry about the delay.

I used librosa to generate the Mel-spectrograms and the specific parameters hop_length, win_length, etc. can be found here. If you've got mels from a TTS system the best approach would be to retrain the Vocoder. To do that you should replace the steps in preprocess.py with the exact steps used for preprocessing the mels for the TTS system (but include the padding step).

Unfortuately different preprocessing does have a big effect so its very important that the preprocessing pipeline for the TTS system and the vocoder line up.