Simultaneously calls to API

mercuryyy commented 10 months ago

Is it possible to run simultaneously calls to API and have both call run concurrently at the same time?

erew123 commented 10 months ago

I have no buffering built in currently and as far as I am aware, it can only generate 1x thing at a time.... though in all honesty, I haven't tested. I've currently set no lock on the API to stop you trying it... meaning, if you send multiple requests simultaneously, there is nothing in the script to say "No, I'm already processing something, so I am not going to accept that request". I suspect it will queue up the request or cancel the current generation and start the new one, but I don't know for sure.

The honest truth is, I don't actually know for sure and its something I was going to look at, at some point.

mercuryyy commented 10 months ago

Thanks i'll run some tests see what happens

erew123 commented 10 months ago

Hi @mercuryyy. Have you found your answer? Would you like me to leave the ticket open?

erew123 commented 10 months ago

Hi @mercuryyy Im going to assume this closed for now, but feel free to re-open if needed. Thanks

GiulioZ94 commented 8 months ago

Simultaneous calls to the API currently mix chunks between requests, resulting in a mixed WAV file. Is there a way to handle simultaneous calls, possibly using a queue or similar method, to avoid this issue?

erew123 commented 8 months ago

Hi @GiulioZ94

I've run no further tests myself on this. Its theoretically possible to build a queue system into the API and Ive not performed any locking currently on the API, so if you dont want to handle queue management within any application your end, I would have to look at this.

What API endpoint are you calling?

GiulioZ94 commented 8 months ago

Hi @erew123

Thanks for helping, I'm using tts-generate-streaming api in get mode

erew123 commented 8 months ago

@GiulioZ94 So let me just confirm. Lets say you have 2x lots of audio to generate, A and B. You:

1) Are wanting AllTalk to generate the first set of streaming audio (A) before it it starts work on the next stream (B), but you would like to send the text of request B to AllTalk while A is in process of being streamed/generated.

2) Are not wanting simultaneous generation of A and B, being requested from 2 different source locations and being generated simultaneously and sent back to simultaneously to those 2 source locations at the same time.

I'm assuming you want 1, but I just want to be sure I've got your request correct.

If so, it will take a bit of thinking about as I suspect because of the single threaded nature of Python, it might require a slightly different start up script with uvicorn running multithreaded and also the python queue system coding in and then ??? amount of testing.

FYI, Im about to sign off for the day, so wont be responding again for a while.

Thanks

GiulioZ94 commented 8 months ago

@erew123, yes, it would be great if it's possible to handle requests simultaneously so 1. If not, at least ensure things don't break and handle requests one at a time.

erew123 commented 8 months ago

@GiulioZ94 Ill make a note of it. Its something I may or may not be able to get to figure out soon ish. I currently mid a major update of lots of things and have quite a decent chunk to work on+test on multiple OS's.

So bear with me on this. I've added it to the Feature requests https://github.com/erew123/alltalk_tts/discussions/74 so its in there as a "to do/investigate"

Thanks

GiulioZ94 commented 8 months ago

@erew123 Sure take your time. Thanks for your work.

Swastik-Mantry commented 5 months ago

Hey @erew123, to solve this issue, I have tried having a pool of multiple models of XTTSv2(about 4 of them) and tried using different models for each request synthesis(Using a queue-like implementation), but simultaneous requests lead to error. [Source of idea]

At times of no errors, the audio got mixed up for simultaneous requests (audio of request 1 got mixed with request 2 or 3, eventually all the speech generated was gibberish garbage) while sometimes I got the following torch errors

"Assertion srcIndex < srcSelectDimSize failed"
"probability tensor contains either inf, nan or element < 0" or
"RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect"
The error mentioned in the source of idea link ( I think there must be some race condition to cause them)

I had also tried using executor.multi-threading / multi-processing to serve the requests i.e synthesis on the XTTSv2 model concurrently and be free of those errors but it didn't work out.

I know you have just worked alot to release the beta version. So, please have a good rest, you really deserve it. Incase you work on this problem, please let me know if you come with a possible solution and your thought process for it. Thanks again.

erew123 commented 5 months ago

@Swastik-Mantry Thanks for info on this. Its certainly one of those problems that Im not sure if it is or isn't achievable in some way. Resource sharing, multi models and ensuring requests all go back to the correct source are potentially unsolvable issues and its highly possible Coqui's scripts wont be capable of supporting it, even if AllTalk can. Its something Ill take a look at, at some point, but I really appreciate having your experience on it! :) Its a nice head start and at least I can discount a couple of possible routes.

gboross commented 5 months ago

Hello everyone,

I’m reaching out to inquire if there have been any updates on the topic you discussed earlier. We are looking to set up an AWS system capable of handling multiple requests simultaneously, without queuing them, provided there are sufficient resources available.

Additionally, I came across these two Coqui variations, but I couldn’t find a solution to our issue there either. Could you possibly assist us with this matter?

https://github.com/idiap/coqui-ai-TTS
https://github.com/daswer123/xtts-api-server?tab=readme-ov-file

Thank you!

erew123 commented 5 months ago

Hi @gboross I've done a bit of investigation into this and there is a limitation of Transformers/CUDA where the Tensor cores do not fully segment requests within one Python instance. AKA, if 2x requests come in to 1x python instance, the data getting sent into the tensor cores from the 2x requests, get jumbled up into 1x block of data and come out as a mess.

I cannot recall all the exact things I looked into at that time, however, there potentially is a way to use a CUDA technology to segregate/track the tensor cores within Python (I believe) but, it will require an entire re-write of the whole Coqui inference scripts to handle that. Its not impossible, but its not just 10 lines of code and 1x script. You are looking at a lot of code to do that.

The alternative is to build a queue/multiplexing system where multiple XTTS engines get loaded by multiple different Python instances, and when one is busy, one of the free ones gets picked/used, which will maintain the segregation between tensor cores and the requests. Again, this is a decent amount of coding, but, it doesnt require re-jigging the Coqui scripts in this circumstance, just a multiplexing queue/tracking system.

gboross commented 5 months ago

Thanks for the detailed explanation! Sounds like the tensor cores are throwing a bit of a party with the data. 😂 Could you possibly work on the second option you mentioned? It sounds like a more feasible approach without needing a complete overhaul of the Coqui scripts.

Of course, we’d be willing to pay for your help if you're open to solving this issue.

Thanks a lot!

erew123 commented 5 months ago

Hi @gboross

It is something I can look at, though I want to be fair and clear that what I have suggested above is all a theory I have about how to make it work. There is probably 6-10 hours of building/testing something as a rough prototype to attempt to prove the theory and figure out how well it can perform. Obviously with multiple requests being sent into a GPU, I have lmited experience as to how well the NVidia will time-slice the GPU in practice. My research said it should 50/50 resources if 2x requests come in, or 33/33/33 resources if 3 simultaneous requests come in, and I assume this would be enough on a reasonable GPU to keep up with something like multiple streaming requests, though I imagine there is a breakpoint at somewhere, depending on the GPU in question e.g. a 4090 is going to outperform a 3060 and each of those hardware options will have a limitation somewhere down the line.

If I did get as far as testing a prototype, I would imagine from that point on, there is a another 40 hours minimum to build/test a working queue system. It would need to handle/set the maximum amount of Python instances you could start (so you could alter it based on different hardware), The queue engine would therefore need to handle anything from 2 to ??? amount of TTS engines being loaded in dynamically (again to handle different hardware scenarios). There would also be a need to handle the situation where all engines are currently in use and what to do in that situation, it may even be that there is a need to look at load balancing across multiple GPU's, or hold in the queue until one engine becomes available again etc, or maybe fall back to a lower quality TTS engine that can be processed on the CPU.

Then of course there will be a need to do something in the interface to manage/configure this system etc.

Finally there will be testing/debugging/possible re-work of bits, or even the possibility that for reasons it may just not work as expected.

All in, it could be bottom end 50 hours work and upper end potentially closer to 100, with no absolute guarantee of this will 100% work the way I have proposed. Its also something I would want to re-research and think over, just to make sure I am nailing the design before I would even touch some code.

I guess I would need to roll it around my head a little more and firm up my thoughts on it before committing to something.

erew123 commented 2 months ago

AllTalk MEM is as far as I have gotten with this so far. It is basically untested, but will start multiple engines and EDIT has a API queue system built in. Works with any TTS engine that AllTalk supports.

I've uploaded it to the beta and you would need to start the python environment and python ts_mem.py to start it.

You can decide how many engines you want to have available to run...... depending on your hardware capabilities.

⚠️ MEM is not intended for production use at this time and there is NO support being offered on MEM. ⚠️

erew123 commented 2 months ago

Queue system built, with management. Image below.

Basic load testing too:

erew123 commented 2 months ago

Please see the updated Multi Engine Manager (MEM). Its about as far as I'm thinking take it for now. You still need to run/configure AllTalk first and whatever engine you set in AllTalk as the default & its engine settings are what MEM will be loading when you start engines.

Here is the commit https://github.com/erew123/alltalk_tts/commit/372aa493d6c7e64dbbb2b6b09cbb8f436a4992c2

You can git pull the AllTalk V2 Beta, start the Python environment and run MEM with python tts_mem.py

I'm reasonably sure there are no additional requirements to install........ reasonably sure.

I've not tested many things:

streaming
engines other than Piper and XTTS
lots of other things (This is just a research demo)

To be super clear, I do not have the equipment nor time to test every scenario, reliability, stability, performance etc. This has been given a basic testing and in theory, it works. I am therefore offering no support on it. It is, what it is.

MEM will pretend to be a standalone AllTalk server, and will respond to these endpoint requests from clients:

Ready Endpoint - http://{ipaddress}:{port}/api/ready Voice Endpoint - http://{ipaddress}:{port}/api/voices RVC Voices Endpoint - http://{ipaddress}:{port}/api/rvcvoices Current Settings Endpoint - http://{ipaddress}:{port}/api/currentsettings TTS Generation Endpoint - http://{ipaddress}:{port}/api/tts-generate OpenAI Compatible Endpoint - http://{ipaddress}:{port}/v1/audio/speech

It is in effect relaying/queuing the requests between engines and the client. The above endpoints are of course documented in the main AllTalk V2 Gradio interface in the API Endpoints & Dev section.

As standard it will run on port 7851 for requests, just like AllTalk does, but you can change the ports if you wish. Each engine loaded will start on its own port number (also customisable).

You can create as many Engine instances as you want (only showing 8 here) and start/stop them as you wish. To change the amount of potential engines you change the `Maximum Engine Instances` in the settings.

You can individually test each engine instance

There are a variety of settings to handle how engines are loaded, the queue is managed, ports for various areas of MEM and there is documentation for these.

There is a Queue viewer built into the Gradio interface

A basic load tester is built in

General basic documentation

Command line output

vlrevolution commented 1 week ago

This is amazing, thank you so much for working on this!

erew123 / alltalk_tts