erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
686 stars 71 forks source link

Possible to run the models entirely on CPU+RAM or the 2nd GPU? #252

Closed MotherSoraka closed 1 month ago

MotherSoraka commented 1 month ago

I don't think LowVRAM is quite cutting it for me. i have a 12700K and a spare 1050 2GB GPU. Is it possible to run the models entirely (XTTS 2.0.3) on either my 1050 or my CPU?

And while you're here, Is it possible to not hide the real-time text streaming, Let the Text to stream normally and only attach the voice file when its done?

Insane project btw, so much work, so much Wow.

erew123 commented 1 month ago

Hi @MotherSoraka a 2GB GPU isnt enough to squeeze the model and overhead for processing in without flowing over into System Ram (on Windows) or possibly saying "no Im not doing it" and crashing (Linux).

If you wanted to test solely on CPU generation, without having to remove your GPU, you can edit the XTTS model engine script e.g https://github.com/erew123/alltalk_tts/blob/alltalkbeta/system/tts_engines/xtts/model_engine.py

You would disable the LowVRAM before doing this or you can expect some strange happenings.

So you would change:

self.device = "cuda" if torch.cuda.is_available() else "cpu"

to

self.device = "cpu" if torch.cuda.is_available() else "cpu"

That will force it to stay on CPU no matter what, though I cant say if it will or wont work. But you cannot use the LowVRAM setting (possibly DeepSpeed too) with that.

RE "Is it possible to not hide the real-time text streaming, Let the Text to stream normally and only attach the voice file when its done?"

Im not sure if I am interpreting your question correctly on this one, so you may have to re-phrase it. As far as TTS generation, the XTTS AI model needs a starting wav audio sample for it to clone/copy a voice. As far as output goes, Coqui's scripts demand to be interacted with in the way Ive interacted with them. So for example, on streaming generation, although it requires a wav file, it doesnt actually generate a wav file that is saved to disk, its a wav stream.

Not sure if that does or doesnt answer that part of your question.

Thanks