Rongjiehuang / GenerSpeech

PyTorch Implementation of GenerSpeech (NeurIPS'22): a text-to-speech model towards zero-shot style transfer of OOD custom voice.
MIT License
318 stars 45 forks source link

Some questions after reading the paper #4

Closed enhuiz closed 1 year ago

enhuiz commented 1 year ago

Hello authors,

I have read your paper and find it to be very promising. I have a few questions about the pretraining stage mentioned in the paper. Specifically, I am wondering how important the finetuning of wav2vec2 for the speaker representation $\mathcal{G}_s$ and the emotion representation $\mathcal{G}_e$ is. Is it possible to directly use the features generated by wav2vec2 for training, or would this significantly harm the style similarity? If we had a model without the pretraining stage, how would the CSMOS compare to the results shown in Table 3?

I also have a question about the VQ codebook. As I understand, a 1-way 128 tokens codebook is used in this work. As mentioned in the paper, VQ is prone to index collapse. Have you considered using a Gaussian-based VAE or a vanilla autoencoder-like bottleneck instead?

Finally, could you explain the function of the shuffle operation? Does it work similarly to an entire channel dropout, or does it have other advantages over dropout?

Thank you in advance for your time and consideration.

Rongjiehuang commented 1 year ago

Hi, sorry for the late reply, I was busy finishing my work in the last few weeks :)

Fine-tuned global embedding is used for the condition, while directly using the features generated by wav2vec2 for training should harm the style similarity. As for the VQ codebook, we did not try the Gaussian-based VQVAE (i.e., include a regularization loss before quantification), and it may help to stabilize model training. For shuffle operation, it is similar to the style-aware dropout operation, and I think these two operations should have similar performance.