bshall / ZeroSpeech

VQ-VAE for Acoustic Unit Discovery and Voice Conversion
https://bshall.github.io/ZeroSpeech/
329 stars 46 forks source link

Use as Universal Vocoder #3

Closed andrey1362010 closed 4 years ago

andrey1362010 commented 4 years ago

@bshall Thank you for this implementation. Can I use this repository as a universal vocoder? I want to train tacotron with vq-vae features. Will this work?

bshall commented 4 years ago

Hi @andrey1362010,

That's an interesting question. The wavenet-vqvae paper showed that you can get good phoneme classification accuracy by simply associating the vq-vae code with it's most common phoneme. So I'd guess that it has a good shot of working, but I can't be sure . I know that there has been some work on using vq-vae features for ASR but I haven't seen anyone trying it for TTS.

If it does work it'd be interesting to see if you can train the TTS system with less supervision i.e. less than an hour of transcribed speech.

Let me know if you give it a try or need any help setting things up.

andrey1362010 commented 4 years ago

Hi @bshall, I lost interest in TTS :) But started experimenting on VC. Now I am training a vq-vae model on single speaker emotional dataset. It is interesting to see whether tokens will be separate for each style or not. I have a couple of questions:

  1. It is possible to train separately encoder and decoder.
  2. Have you tried using the MFCC features instead of spectrograms?
bshall commented 4 years ago

Hi @andrey1362010,

Great, let me know if it works, I'm interested to hear the results.

Unfortunately it isn't possible to train the encoder and decoder separately. You can experiment with training a decoder to reconstruct the Mel-spectrograms instead. Its much quicker to train but doesn't disentangle the speaker identity as much.

I haven't tried MFCCs but the this paper reports better ABX scores when using MFCCs. They don't provide any voice conversion samples so I'm not sure if this would translate into better audio.

Good luck with the training (it takes a while).