Closed aedocw closed 1 year ago
This is probably not going to happen. Lots of playing around has yielded weird and inconsistent results. With the same sample, running the same text multiple times, most of the time the result was good and superior to VITS. HOWEVER, on multiple occasions, the text that was read back did not exactly match what was fed in! Sometimes entire words were dropped, other times, only part of a word was spoken (streambed became stream for instance).
The other larger issue is that there's no way to predict how much text you can feed it at a time. AND since this is a "one shot" model, you need to feed the sample and new text each time, meaning you waste a lot of resources on that first pass of cloning the voice.
It may be possible to use XTTS with RVC (https://github.com/skshadan/TTS-RVC-API)? When I have time, I'll experiment and see if chaining tools like this could work to give an even better sounding reading voice.
Will not be doing this, XTTS seems nice for one-shot short clone attempts, but currently not suitable for long stuff like books.
XTTS v2 is so good, and works awesome. This has been added. It's slow (as expected), but worth the wait. Support for XTTS has been merged.
Now that Coqui has opened use of XTTS for non-commercial use, add support for using that instead of VITS.
Something like including: tts = TTS("tts_models/multilingual/multi-dataset/xtts_v1", gpu=True)
and: tts.tts_to_file(text = chapters_to_read[i], file_path = outputwav, speaker_wav="sample.wav", language="en")
Note this requires speaker sample file to be included.