Closed andrey1362010 closed 4 years ago
Hi @andrey1362010,
That's an interesting question. The wavenet-vqvae paper showed that you can get good phoneme classification accuracy by simply associating the vq-vae code with it's most common phoneme. So I'd guess that it has a good shot of working, but I can't be sure . I know that there has been some work on using vq-vae features for ASR but I haven't seen anyone trying it for TTS.
If it does work it'd be interesting to see if you can train the TTS system with less supervision i.e. less than an hour of transcribed speech.
Let me know if you give it a try or need any help setting things up.
Hi @bshall, I lost interest in TTS :) But started experimenting on VC. Now I am training a vq-vae model on single speaker emotional dataset. It is interesting to see whether tokens will be separate for each style or not. I have a couple of questions:
Hi @andrey1362010,
Great, let me know if it works, I'm interested to hear the results.
Unfortunately it isn't possible to train the encoder and decoder separately. You can experiment with training a decoder to reconstruct the Mel-spectrograms instead. Its much quicker to train but doesn't disentangle the speaker identity as much.
I haven't tried MFCCs but the this paper reports better ABX scores when using MFCCs. They don't provide any voice conversion samples so I'm not sure if this would translate into better audio.
Good luck with the training (it takes a while).
@bshall Thank you for this implementation. Can I use this repository as a universal vocoder? I want to train tacotron with vq-vae features. Will this work?