Questions About VITS Code Modifications and Model Performance

Hi, Thanks for your great works! I'm curious to understand your thought process as a learner. May I ask why you decided to make modifications to the original VITS code?

You mentioned 'robust,' but I'm not quite clear on its exact meaning. Does it refer to the model's performance in different aspects, such as WER (Word Error Rate) or talking speed?
When you talk about 'speech quality,' are you referring to the sound quality of the generated speech? Is it similar to audio quality metrics like PSEQ?
Regarding the 'expanding the receptive field of the Wavenet Flow module' modification, how did you analyze the need for this change, and in what ways does it enhance the quality of synthesized speech?
I noticed that the original VITS was trained using PyTorch, but you chose to rewrite some code in TensorFlow. What motivated this decision? Are there specific advantages or requirements that led to this change in the tech stack?

NTT123 / light-speed

Questions About VITS Code Modifications and Model Performance #3