NTT123 / light-speed

A modified VITS that utilizes phoneme duration's ground truth for better robustness
MIT License
115 stars 35 forks source link

Questions About VITS Code Modifications and Model Performance #3

Open TinaChen95 opened 1 year ago

TinaChen95 commented 1 year ago

Hi, Thanks for your great works! I'm curious to understand your thought process as a learner. May I ask why you decided to make modifications to the original VITS code?

  1. You mentioned 'robust,' but I'm not quite clear on its exact meaning. Does it refer to the model's performance in different aspects, such as WER (Word Error Rate) or talking speed?

  2. When you talk about 'speech quality,' are you referring to the sound quality of the generated speech? Is it similar to audio quality metrics like PSEQ?

  3. Regarding the 'expanding the receptive field of the Wavenet Flow module' modification, how did you analyze the need for this change, and in what ways does it enhance the quality of synthesized speech?

  4. I noticed that the original VITS was trained using PyTorch, but you chose to rewrite some code in TensorFlow. What motivated this decision? Are there specific advantages or requirements that led to this change in the tech stack?