Open thanhlong1997 opened 1 year ago
Thank you for visiting our work.
In the paper, we used the expression 'jointly trained' to represent the opposite meaning of 'pre-trained'. This meant that the speaker encoder would be trained from scratch with the entire model. And as you mentioned, the speaker encoder is part of our entire model.
Hope this clarifies your question.
Oh thank you for clarifying that, I just thought "jointly trained" mean along side training TTS model, you also training speaker encoder network and join their losses. I am training your model in Vietnamese language. Can I keep the issue open while the model is not convergence yet ? So I can have your advise for that process? Thank you!!!
Hi, thank you for great paper and repo I got a confusion when debug your repo following your paper in Interspeech 2023. I saw you said in the paper that you join train speaker encoder alongside VITS model but when I inspect the code, I can not find where you have implemented that join training. As I understand, your code use speaker encoder as an layer in VITS model. Pls can you make it clear for me? Thank you !!!