Question: Should I use XTTS-v2?

aedocw / epub2tts

Turn an epub or text file into an audiobook

Apache License 2.0

445 stars 44 forks source link

Question: Should I use XTTS-v2? #101

Closed Zombobot1 closed 7 months ago

Zombobot1 commented 7 months ago

I find it a bit confusing to decide which model to use. At first I wanted to use XTTS-v2 because the coqui team claims it is their best model. However, the xtts parameter requires samples, presumably for voice cloning. I assume the default model is vits. My question: is it possible to use xtts without voice cloning to get better quality than vits? After listening to the samples provided, I think sample-p307-coquiTTS sounds better than sample-shadow-coquiXTTS.

Keep up the great work! Your efforts are making a difference!

aedocw commented 7 months ago

Coqui using VITS model is going to be the fastest option. If you like that one, you don't need to specify the model (it's default since that was the first thing I started with). You can specify a different speaker, you can see all the options with "tts --model_name "tts_models/en/vctk/vits" --list_speaker_idxs". Personally p335 (female) and p307 (male) were my favorites, after having made and listened to all of them.

XTTSv2 tends to sound more human to me, especially if you spend some time fine-tuning (see the "utils" subdirectory for more info on this). Keep in mind XTTSv2 (--xtts) requires a GPU, and even with a GPU it's likely to run at about real-time (so 10 hours of reading takes around 10 hours, at least for me with a 3060ti). Compared to VITS it is extremely slow.

Thanks for using it, I appreciate the feedback, and feel free to ask any other questions!

martinmildner commented 7 months ago

I have created a repository that, for the time being, contains audio sample files of all available speakers from 'tts_models/en/vctk/vits' speaking the first few sentences of the 'sample.txt' file: https://github.com/martinmildner/coqui-voice-samples

Which other models and speakers should not be missing?

Zombobot1 commented 7 months ago

@aedocw Thanks for your elaborate answer. Unfortunately, I don't have a GPU so I guess XTTSv2 is not for me.

@martinmildner Could you please add XTTSv2 speakers too?

aedocw commented 7 months ago

I've heard mention that StyleTTS2 sounds amazing and is fast on CPU. I have not played with it yet (not enough time!) but I will soon, and will be following the progress. I will integrate it as an option if it sounds good when it seems like it is stable and ready for use.

aedocw commented 7 months ago

Coqui speakers have been added now as well. Closing this, but please feel free to add something in "discussions" if you want to kick off a conversation :)