erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
686 stars 71 forks source link

Any way to debug why AllTalk is not using GPU #235

Closed LostRuins closed 1 month ago

LostRuins commented 1 month ago

I know you previously mentioned that Colab is not an officially supported environment. However, I did get it to work on Colab - I can generate audio just fine. I'm just puzzled why the GPU does not seem to be utilized. I did follow the troubleshooting guide to clear the Pip cache and Custom Python Environment, but it still does not work although Pytorch with CUDA is clearly installed.

image

Here is my One-Click Colab script, just paste this into a blank notebook and AllTalk will be ready to use with a cloudflare tunnel.

!git clone https://github.com/erew123/alltalk_tts
!chmod +x './alltalk_tts/atsetup.sh'
!apt-get install -y expect
!expect -c 'spawn ./alltalk_tts/atsetup.sh; expect "ALLTALK LINUX SETUP UTILITY"; send "2\r"; expect "ALLTALK STANDALONE APPLICATION SETUP"; send "5\r\r"; expect "ALLTALK STANDALONE APPLICATION SETUP"; send "4\r\r"; expect "ALLTALK STANDALONE APPLICATION SETUP"; send "1\r";  expect "Press any key to continue"; send "\r9\r"; interact'
!curl -L https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 -o 'cloudflared-linux-amd64'
!chmod +x 'cloudflared-linux-amd64'
!rm -f /.dockerenv
!nohup ./cloudflared-linux-amd64 tunnel --url http://localhost:7851 &
!sleep 10
!cat nohup.out
!chmod +x './alltalk_tts/start_alltalk.sh'
!cd alltalk_tts && ./start_alltalk.sh

Let me know if you have any ideas.

erew123 commented 1 month ago

Hi

I did start testing colab a while back e.g.

#@markdown #####You can enable gdrive that drag and drop samples directly through your google drive via the `drive/Mydrive` path
mount_gdrive = False #@param{type:"boolean"}

if mount_gdrive:
  from google.colab import drive
  drive.mount('/content/drive')

print("*******************************************************************")
print("** Installing server requirements. This will take 5-10 minutes ****")
print("*******************************************************************")
!apt install libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg libaio-dev > '/dev/null' 2>&1
print("************************")
print("*** Cloning AllTalk ****")
print("************************\n")
!git clone -b apirework https://github.com/erew123/alltalk_tts
print("\n********************************")
print("*** Installing Requirements ****")
print("********************************\n")
!pip install --no-deps --quiet -r /content/alltalk_tts/system/requirements/requirements_googlecolab.txt
print("\n*****************************")
print("*** Installing DeepSpeed ****")
print("*****************************\n")
!pip install deepspeed
print("\n******************************")
print("*** Installing Cloudflare ****")
print("******************************\n")
# Install cloudflare
!wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb > '/dev/null' 2>&1
!apt install ./cloudflared-linux-amd64.deb aria2 > '/dev/null' 2>&1
!rm cloudflared-linux-amd64.deb > '/dev/null' 2>&1
!python -m spacy download en_core_web_md
print("\n*************************************")
print("*** Downloading the Base TTS model ****")
print("***************************************\n")
!python /content/alltalk_tts/modeldownload.py
print("************************************")
print("** Server requirements installed ***")
print("*** Please proceed to next step ****")
print("************************************\n")

but I found there are issues with redirects over a tunnel... So even if you get it loaded in (which I did) there are still issues.

I had to write some extra detection code on start-up of AllTalk as Im pretty sure it needs to import tensorflow to run the CUDA bits on GPU (as I loosely recall) but I don't think those are in the main bits of the AllTalk "main" code...

This is part the reason I am on with V2 of AllTalk, which I should have a BETA out in the next few days. See here for details https://github.com/erew123/alltalk_tts/discussions/211

And that should resolve the issues. As Im still deep in code, I havnt tested/set anything up on Google Colab, but if all goes well, I should be able to start testing that next week. Im in code tidy up, documentation and a few other bits of code to re-write... getting close though. In theory, when I can finally test on Colab, I have the workbook built, so it should mostly be a re-point at the correct github branch, download it and probably make a few minor changes, then all should be good (famous last words).

So you may want to hang back a few days and hopefully all will be resolved...

LostRuins commented 1 month ago

Sounds good. I was looking for an alternative Colab since the xtts one stopped working for some reason.

erew123 commented 1 month ago

Lots of the Coqui TTS requirements are falling behind now to be honest. I have got new versions of the XTTS engine though and Im making sure the new version of AT will work on Python 3.11, possibly 3.12 too. Plus obviously there will be 4x TTS engines from the word go, with more to come, so AllTalk is as lightweight or heavyweight on system requirements as you want.

The API is massively extended, but you can use it as simply or as complexly as you wish (Ill be documenting it all before sending out a beta). And of course, there's a lot more to AllTalk now so, it should be pretty damn flexible for most peoples needs.

Ill drop you an update when the beta is out if you like!

LostRuins commented 1 month ago

Yeah their main mistake was not pinning dependency versions. For example, torchaudio 18.0 requires pytorch 2.3.0, which supports a different set of cards... dependency hell.

erew123 commented 1 month ago

Hah, no need to tell me about that.. I think I spent about 11 hours yesterday ensuring the requirements files would install everything, correctly across Windows & Linux for 5x TTS engines (the 4x main ones and RVC), along with DeepSpeed builds. And dealing with an annoying Gradio wants x version of something and something else wants another version. Painful, but resolved.

erew123 commented 1 month ago

@LostRuins I've finally posted the BETA out AllTalk v2 BETA Download Details & Discussion

Ive been at it none-stop for days now. I'm intending to just have a bit of a mental break/relax and Ill give Colab a go in a couple of days time.

LostRuins commented 1 month ago

Cool, looking forward to it!

erew123 commented 1 month ago

Hi @LostRuins

As I just discussed on that other ticket/issue, I've 90% got the google colab working. If you look in the alltalk beta here there is "googlecolab.ipynb"

In short you can upload the ipynb to your google colab area and run up AllTalk. On free google colab, its about a 10-15 minute install time. When it starts up, 1st time, you will just have to wait the 60 seconds for it to timeout and auto download a model (in this case will be a piper model and you can download other voices/tts engine models in the interface).

You will get 2x tunnel addresses. One is for the API calls and the other is for the Gradio interface.

Typically when running locally/on a LAN, this would be IPaddress:port

image

Google Colab its going to be port 443 on the tunnels:

image

In theory, you should be able to make a relatively simple change to the current API calls to AllTalk v2 to get it working. Previously, you would get a response like this:

"output_file_url":"http://127.0.0.1:7851/audio/myoutputfile_1704141936.wav","output_cache_url":"http://127.0.0.1:7851/audiocache/myoutputfile_1704141936.wav"

but now AllTalk v2 does not prepend the protocol and port:

"output_file_url":"/audio/myoutputfile_1704141936.wav","output_cache_url":"/audiocache/myoutputfile_1704141936.wav"

So at the simplest level, Kobold would need to prepend whatever IPaddress:Port is set in the Kobold interface for the API address of AllTalk e.g.

http://127.0.0.1:7851/audio/myoutputfile_1704141936.wav

https://douglas-brand-electronic-recommendation.trycloudflare.com/audio/myoutputfile_1704141936.wav

If you can change that behaviour in the existing Kobold code, it should just start working (I believe).

If you are willing to give it a go, you may want to stick with Piper as the TTS engine while testing as it will be quite quick (comparatively).

So RVC voices & Transcoding don't yet work on Colab aka it may soft lock/error. If you use an XTTS model and enable DeepSpeed in the engine settings, it will compile DeepSpeed on the 1st generation, which takes about 110 seconds to do, after that DeepSpeed should work. Ive written a keepalive loop, which I think will keep google colab from shutting down, but Ive not had hours of testing time yet,

The API suite is documented in the Gradio interface and I am happy to answer questions or dive in with the code on Kobold, however, I have to admit I am absolutely bogged down with people messaging at the moment. Just dealing with replying to mails is a couple of hours of my day gone, before I get a chance to touch anything I want to do. Hoping this will calm down though.

I think that covers most things you may want to know, but get back to me as needed! :)

erew123 commented 4 weeks ago

Hi @LostRuins I took a quick look through klite.embd to see how much I could figure out about the current v1 implementation of AllTalk and getting a v2 to work. As mentioned, the code should be exactly the same, for a simple setup at least, bar needing to preprepend the http://IPaddress:Port onto the reply from AllTalk, which I can see is stored in Kobold as const default_alltalk_base = "http://localhost:7851"; (line 3669 and 3782) https://github.com/LostRuins/koboldcpp/blob/1487a4bc812d66654a6c0049bd72e227ba6e4627/klite.embd#L3669C2-L3669C55

I appreciate you are probably busy as hell with Kobold, so I can take a pop at getting this setup if you like? Would you prefer it as a duplication of the AllTalk v1 setup, that way it would be easy to extend the features available through the Kobold interface at some point (if you wanted to add it other features).

Thanks

LostRuins commented 4 weeks ago

If the API is not backwards compatible, I would probably add it a second backend to be selected from the dropdown. If it is backwards compatible (and only the URL is different), then it should already work out of the box right now? Since you can select your endpoint url.

image image

unless something else is changed? I don't really understand what you mean by prepend the Ipaddress:port

Also if you'd like to PR any changes, please use the Kobold Lite repo here: https://github.com/LostRuins/lite.koboldai.net for development, as that's where all the lite development happens.

erew123 commented 4 weeks ago

@LostRuins Sorry for the late reply...

In short, with v2 I send back the url to reach the audio as /audio/myoutputfile_1704141936.wav but on v1 I would send back http://127.0.0.1:7851/audio/myoutputfile_1704141936.wav basically the full path to reach the AllTalk server and its audio endpoint.

I've chopped off the http://127.0.0.1:7851 or http://IPaddress:port part of the URL returned.

There is a drop back to the v1 standard in v2 in the settings page:

Screenshot from 2024-06-18 22-09-49

Which should work for the most part, however, you would have to set the IP address in there each time you load it up on a tunnel setup e.g. google colab, because the http://IPaddress:port will change to whatever tunnel address colab gives you.

I had to change the underlying code of AllTalk so that it will bind and respond to any IP address on the machine its running on, but the problem with that, is that I cant say which IP or URL you may be connecting to AllTalk on so I cannot add it in the return path.

The simplest solution in Kobold would be to just add this:

image

to the start of the /audio/myoutputfile_1704141936.wav giving you the full path (I assume the default_alltalk_baseOR saved_alltalk_url + /audio/myoutputfile_1704141936.wav)

Does that make sense? apologies if not.

LostRuins commented 4 weeks ago

I don't think I am using the /audio endpoint? Kobold sends a request to /api/tts-generate and simply uses the decoded response back directly.

erew123 commented 4 weeks ago

Hi @LostRuins

Ok, so Ive run it through and had a test and figured out exactly what it its/isnt doing. So the generation is going to the standard generation endpoint and setting the "streaming" flag to be set on. So yes its getting back a streaming audio response. This is perfectly fine with XTTS as it supports streaming.

Other TTS engines installed in AllTalk may or may not support streaming and so in the scenario they don't support streaming you get back nothing.

So what I probably should do is set an AllTalk console warning/error message when someone selects streaming and the tts engine doesnt support it, it explains it to them.

Secondly I do actually pass out this information about what the TTS engine running does/doesnt support via another api call, but obviously that will require a re-work of the Kobold setup.

So let me do the warning message on the current v2 which will tell users when they arent using a streaming TTS engine, but request streaming. That aside, I can look to do a V2 full integration at some point.

LostRuins commented 4 weeks ago

I thought it was a sync endpoint, since the audio data seems to be returned as a single blob after completion.

erew123 commented 4 weeks ago

Calling the streaming option, with XTTS at least, it sets up an audio wav stream, generates a few chunks ahead then sends that over to whatever is playing it, while it keeps generating more chunks in the background and just keeps pushing them over as part of the stream/audio file until its finished all the TTS (though I do now have a "stop generation at the current chunk" option with streaming). So you still get a whole wav file by the end of it, but sent over as soon a X chunks have been generated and before it completes the file/actually generated the whole TTS.

erew123 commented 4 weeks ago

The standard generation method I have (not streaming) allows for a Narrator function. So text in asterisks and double quotes are recognised as different character voices. But its not easy/possible to do that with a stream (that I have found a way to do yet) due to the fact you have to load in different sample audio for the character/narrator voices.