auspicious3000 / SpeechSplit

Unsupervised Speech Decomposition Via Triple Information Bottleneck
http://arxiv.org/abs/2004.11284
MIT License
649 stars 92 forks source link

Using ParallelWaveGan instead of Wavenet #41

Closed vishal16babu closed 3 years ago

vishal16babu commented 3 years ago

Is it possible to use PWG vocoder(https://github.com/kan-bayashi/ParallelWaveGAN) instead of Wavenet on the output of the decoder? Specifically, do I need to change the frame length and frame hop to make the mel spectrograms compatible with PWG.

Wavenet inference is very slow so it would help we are able to use any other neural vocoders directly. That way we could just finetune the given pretrained speechsplit models instead of training again from scratch.

auspicious3000 commented 3 years ago

Yes, it is possible. You at least need to retrain one of the models to make them compatible.

vishal16babu commented 3 years ago

Thanks @auspicious3000 , I will give it a try

vishal16babu commented 3 years ago

Hi @auspicious3000 , I looked at the spectrogram calculation code and it does not look like a straightforward mel spectrogram calculation. I also tried using librosa.feature.inverse.mel_to_audio(spec, sr=16000, n_fft=1024) to get audio using Griffin-Lim instead of WaveNet and it resulted in a garbage signal as expected.

  1. Is there any specific reason why you're not using direct mel spectrograms as input features to the network?
  2. How to invert the spectrograms returned by the network using Griffin-Lim or anything other than a Wavenet trained on these custom spectrograms?

P.S: I am not very familiar with common preprocessing techniques to calculate spectrograms. So any references which can help me understand the motivation behind spectrogram calculation code are very much appreciated

auspicious3000 commented 3 years ago
  1. To make it compatible with the wavenet vocoder.
  2. You can train other vocoders as long as the spectograms are consistent.