thanks for awesome work! since i can not understand chinese, i translated readme to english i understood traning process as below
it seems there's two stage training process, training is quite complicated, especially for stage 2 training
For first stage, train VITS(SynthesizerTrn) with whisper ppg, NSF-hifigan, external speaker encoder(d-vector)
Second stage(SynthsizerTrnEx), apply GRL, SNAC for preventing speaker information leakage in text encoder, also apply natural speech loss(bidirectional loss between prior and posterior)
is it right? also, i can not find SynthesizerTrnEx's usage in this code base(maybe currently). could you explain bit more about training process?
thanks for awesome work! since i can not understand chinese, i translated readme to english i understood traning process as below
it seems there's two stage training process, training is quite complicated, especially for stage 2 training
For first stage, train VITS(SynthesizerTrn) with whisper ppg, NSF-hifigan, external speaker encoder(d-vector)
Second stage(SynthsizerTrnEx), apply GRL, SNAC for preventing speaker information leakage in text encoder, also apply natural speech loss(bidirectional loss between prior and posterior)
is it right? also, i can not find SynthesizerTrnEx's usage in this code base(maybe currently). could you explain bit more about training process?