Some questions after reading the paper

Hello authors,

I have read your paper and find it to be very promising. I have a few questions about the pretraining stage mentioned in the paper. Specifically, I am wondering how important the finetuning of wav2vec2 for the speaker representation $\mathcal{G}_s$ and the emotion representation $\mathcal{G}_e$ is. Is it possible to directly use the features generated by wav2vec2 for training, or would this significantly harm the style similarity? If we had a model without the pretraining stage, how would the CSMOS compare to the results shown in Table 3?

I also have a question about the VQ codebook. As I understand, a 1-way 128 tokens codebook is used in this work. As mentioned in the paper, VQ is prone to index collapse. Have you considered using a Gaussian-based VAE or a vanilla autoencoder-like bottleneck instead?

Finally, could you explain the function of the shuffle operation? Does it work similarly to an entire channel dropout, or does it have other advantages over dropout?

Thank you in advance for your time and consideration.

Rongjiehuang / GenerSpeech

Some questions after reading the paper #4