cnaigithub / Auto_Tuning_Zeroshot_TTS_and_VC

Official implementation of "Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis", Interspeech 2023
MIT License
79 stars 10 forks source link

Missing join train speaker encoder #1

Open thanhlong1997 opened 1 year ago

thanhlong1997 commented 1 year ago

Hi, thank you for great paper and repo I got a confusion when debug your repo following your paper in Interspeech 2023. I saw you said in the paper that you join train speaker encoder alongside VITS model but when I inspect the code, I can not find where you have implemented that join training. As I understand, your code use speaker encoder as an layer in VITS model. Pls can you make it clear for me? Thank you !!! image

SeongYeonPark commented 1 year ago

Thank you for visiting our work.

In the paper, we used the expression 'jointly trained' to represent the opposite meaning of 'pre-trained'. This meant that the speaker encoder would be trained from scratch with the entire model. And as you mentioned, the speaker encoder is part of our entire model.

Hope this clarifies your question.

thanhlong1997 commented 1 year ago

Oh thank you for clarifying that, I just thought "jointly trained" mean along side training TTS model, you also training speaker encoder network and join their losses. I am training your model in Vietnamese language. Can I keep the issue open while the model is not convergence yet ? So I can have your advise for that process? Thank you!!!