IIEleven11 / StyleTTS2FineTune

178 stars 32 forks source link

StyleTTS2 vs XTTSv2 #16

Closed C00reNUT closed 2 months ago

C00reNUT commented 2 months ago

Hello,

I have seen that you have done some experiments finetuning Tortoise, XTTS and StyleTTS2.

May I ask you - is the quality of the StyleTTS2 finetunes much better than in case of XTTSv2?

I am mainly asking because I can finetune XTTSv2 on single 3090 card which is something I cannot do with StyleTTS2, so it would save me some time and money :)

Thank you!

IIEleven11 commented 2 months ago

Xttsv2 has more potential but it is easier to mess it up. Stts2 is more expensive but it cannot be as expressive. Its also more forgiving in terms of dataset issues

C00reNUT commented 2 months ago

Thank you for giving this comparison, I will stick to the Xttsv2 then, even if it's less predictable it still has decent level of control with lower temps and some high quality finetuning data... thank you!

Also I feel we have many recent models that sound 'less expressive' like audiobooks reading, which is probably because most of them use common voice and librivox variations as datasets...

IIEleven11 commented 2 months ago

Thank you for giving this comparison, I will stick to the Xttsv2 then, even if it's less predictable it still has decent level of control with lower temps and some high quality finetuning data... thank you!

Also I feel we have many recent models that sound 'less expressive' like audiobooks reading, which is probably because most of them use common voice and librivox variations as datasets...

I have made amazing voice models with xttsv2. I also developed an xttsv2 model that can whisper. Coqui just took the tortoise model and sped it up and gave it multi lingual capability and called it xtts. So in theory it could sound as good as tortoise. But yeah, goodluck!