erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
871 stars 99 forks source link

ability to use models other than xtts? #99

Closed 0xYc0d0ne closed 7 months ago

0xYc0d0ne commented 7 months ago

i was wondering if its possible to use another model like StyleTTS with alltalk instead of the default coqui xtts model since there are probably better models out there for voice cloning...

erew123 commented 7 months ago

Hi @0xYc0d0ne

Not currently no. Its something Im considering, however, there will be a chunk of code to re-write to make it integrate with other models. There is no way to drop in place another model currently.

Thanks

UXVirtual commented 7 months ago

@erew123 I have an experimental fork which is designed to allow use of the English VCTK/VITS model via the API Local option in the AllTalk settings interface. It runs considerably faster on lower end hardware when using CPU inference and has the benefit of multiple voices running off the single model if you need variety of English accents: https://github.com/erew123/alltalk_tts/compare/main...UXVirtual:alltalk_tts:feature/vctk-vits-support

Out of the box AllTalk only supports single speaker models, but my fork allows the use of models with multiple speakers like VCTK/VITS.

I use this when testing and demonstrating portable offline TTS from my M1 MacBook which doesn't have GPU inference via DeepSpeed for XTTSv2. While the results aren't as good as XTTSv2, it is more stable and avoids various hallucinations in longer text.

While the VCTK/VITS model doesn't explicitly allow quick voice cloning, it does demonstrate using an alternate model that is compatible with the underlying TTS python library. TTS will automatically download and install the model you define in the tts_model_name property of AllTalk's config instead of XTTSv2. You can try other single voice models to see if any are suitable -

To make AllTalk use the VCTK/VITS model, you need to edit confignew.json in the AllTalk folder. Change the following property values:

If you're on macOS you can install the espeak dependency that VCTK/VITS requires using the following brew formulae:

brew install espeak

You'll need to see what the equivalent is for Windows or Linux if you are using those OS to run AllTalk.

When making the request via AllTalk's REST API you need to add a character_speaker request attribute and set it to the voice you want (e.g. p226). See here for the full list.

erew123 commented 7 months ago

@UXVirtual That's interesting! Ill need to have a play at some point and continue my thoughts on how this might be integrated. Ive had a few week long debate in my head about how to maybe separate the model loaders out from the rest of AllTalk, allowing the potential to load/use theoretically any model. What you've done though is a nice little addition that isn't too heavy on a re-code.

I'm going to make a note of this in the Feature requests on the discussion forum... and let me head roll over it a bit more.

Give me a bit of time and Ill get back to you at some point! (if thats ok!)

Thanks

UXVirtual commented 6 months ago

Hey @erew123 no problem! The separation of model loaders sounds like a good approach - I look forward to seeing what integrations can be done there :-)