Closed Zombobot1 closed 7 months ago
Coqui using VITS model is going to be the fastest option. If you like that one, you don't need to specify the model (it's default since that was the first thing I started with). You can specify a different speaker, you can see all the options with "tts --model_name "tts_models/en/vctk/vits" --list_speaker_idxs". Personally p335 (female) and p307 (male) were my favorites, after having made and listened to all of them.
XTTSv2 tends to sound more human to me, especially if you spend some time fine-tuning (see the "utils" subdirectory for more info on this). Keep in mind XTTSv2 (--xtts) requires a GPU, and even with a GPU it's likely to run at about real-time (so 10 hours of reading takes around 10 hours, at least for me with a 3060ti). Compared to VITS it is extremely slow.
Thanks for using it, I appreciate the feedback, and feel free to ask any other questions!
I have created a repository that, for the time being, contains audio sample files of all available speakers from 'tts_models/en/vctk/vits' speaking the first few sentences of the 'sample.txt' file: https://github.com/martinmildner/coqui-voice-samples
Which other models and speakers should not be missing?
@aedocw Thanks for your elaborate answer. Unfortunately, I don't have a GPU so I guess XTTSv2
is not for me.
@martinmildner Could you please add XTTSv2
speakers too?
I've heard mention that StyleTTS2 sounds amazing and is fast on CPU. I have not played with it yet (not enough time!) but I will soon, and will be following the progress. I will integrate it as an option if it sounds good when it seems like it is stable and ready for use.
Coqui speakers have been added now as well. Closing this, but please feel free to add something in "discussions" if you want to kick off a conversation :)
I find it a bit confusing to decide which model to use. At first I wanted to use
XTTS-v2
because the coqui team claims it is their best model. However, thextts
parameter requires samples, presumably for voice cloning. I assume the default model isvits
. My question: is it possible to usextts
without voice cloning to get better quality thanvits
? After listening to the samples provided, I thinksample-p307-coquiTTS
sounds better thansample-shadow-coquiXTTS
.Keep up the great work! Your efforts are making a difference!