erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
1.17k stars 123 forks source link

Ability to stop generation via API. #193

Closed RenNagasaki closed 7 months ago

RenNagasaki commented 7 months ago

Is your feature request related to a problem? Please describe. I'm using this tool to generate Text for a Game on the fly. But when someone skips a dialog, I want to be able to stop the generation so the next dialog can start generating.

Describe the solution you'd like An API call to tell it to stop.

Describe alternatives you've considered At the moment I'll be using the "Ready" Endpoint, but that halts all Voice until the skipped text is done.

erew123 commented 7 months ago

Hi @RenNagasaki

This has been asked for already, quite recently, though I dont think I added it to the Features request list.

Streaming requests I probably can cancel/pause/stop. Though Ive yet to test how that affects the audio its generating. Its possible it may cause a bad End of file and send an error in the stream, which of course could be an issue.

Standard generation requests, I don't think it can be done because the entire block of code is already in the CUDA tensor and well out of AllTalk's ability to interact with. However, saying that it may be possible to interrupt a Narrated TTS generation as portions of Narrated text are sent over for generation one by one as the text is split into character or narrator.

Coming back to your other ticket though, re multiple streaming requests, you may recall I asked if you wanted a queue system OR parallel generation. Im not sure what scenario you are planning on, but, lets say you have 3x different users accessing 1x installation of AllTalk, if you have 3x different people accessing 1x copy, the problem with sending a cancel request is differentiating which users request we are cancelling and then of course, where in the code that request is, which could get very messy and complicated. If its just 1x person using 1x copy, that should be easier.

If you could give me a bit more detail on your use scenarios, Ill have a scratch of my head and think about if something can be done as a solution.

Thanks

RenNagasaki commented 7 months ago

@erew123 Jeah that would be for the best.

My goal is to give the player the ability to generate streaming audio on the fly. Meaning if chat or dialog appears it starts generating(only one at a time, dialog overrules chat). If the user skips a dialog, the next one should start geneariting and the one before should stop.

The optimal way would be to have each user use their own local alltalk instance but since it needs an nvidia gpu which not everyone has that seems unfeasible(correct me if i'm wrong). And CPU is far too slow for realtime generation.

I tried piperTTS beforehand which is darn fast. Raspberry PI can generate in realtime, but the voice training didn't work at all. And its much more annoying to work with.

Your's is so lovingly plug&play. 😄

Kind regards.

erew123 commented 7 months ago

@RenNagasaki

Ok 1x person 1xGPU w/streaming generation... yeah that should be possible, though as I say, that's in theory w/testing required to see how it responds. Though there was another add-in that might be an alternative way around it, should my idea not work.

I should be adding Piper, VITS models and in theory any other TTS engine out there, at least, I'm trying to build AT to be capable of just having very simple "install and work with this TTS engine" capability. (I wont just install all TTS engines as part of the setup routine). Not sure Ill get to setting up finetuning for them all from the word go though.

Im not sure how to measure exactly how far I am into the code of the next big release, but Ive tackled some of the worst issues already, so if I can keep up a decent pace, I may have something for people to try in the next few weeks (I don't want to over promise). Im sure you know how it is, some bits of code go super fast and then youre stuck on one strange problem for hours.

Its in the list of things though, so Ill give it a go and let you know when its up.

Thanks

RenNagasaki commented 7 months ago

Thanks for responding @erew123 Just to clarify. At the moment AllTalk only works realtime (streaming while its still generating) with an Nvidia GPU, AMD doesnt work, correct?

erew123 commented 7 months ago

One person said it does! The whole discussion is here:

https://github.com/erew123/alltalk_tts/discussions/132

I've been waiting/hoping someone else with an AMD card would try it as I don't have one. Certainly it would work perfectly with ZLUDA installed, but that person claims it works fine without (on some AMD cards).

Obviously other TTS engines that I install/setup may work fine e.g. Piper could be a fallback for AMD cards etc.

RenNagasaki commented 7 months ago

OHHHHHH I'm on my way to try! Have an 6800XT in my server. 😄

erew123 commented 7 months ago

I absolutely have no idea if the ROCm libs are just emulating CUDA. If it doesnt then AllTalk (or anything in python) would just go "no, no cuda here, using CPU". Basically its this line of code that decides if the whole script will work on CUDA or CPU:

device = "cuda" if torch.cuda.is_available() else "cpu"

On the other hand ZLUDA does emulate CUDA on AMD card, so the Python code should, on an AMD card with ZLUDA installed say "Hey there's CUDA, Ill use that card"

Never been able to test and without an AMD card, its not been easy for me to code anything for AMD (or mac Metal, which is a very mixed bag too).

erew123 commented 7 months ago

If you do have any success or find out anything, you're welcome to drop it back on the discussions post (might as well keep it all in one place) https://github.com/erew123/alltalk_tts/discussions/132

RenNagasaki commented 7 months ago

Sure, will do that!

RenNagasaki commented 7 months ago

@erew123 Could you maybe, I know I'm asking much, implement the stopping of streaming generation?(As long as its nothing too big of a change on your side) Or maybe as long as its not too complicated tell me how I can do that locally?

That's one of my biggest stoppers at the moment. 😢

Sorry If I sound rude, still struggling with friendly english. 😅

I'm no python developer at all, otherwise i'd do that myself.

erew123 commented 7 months ago

It will be a decent enough chunk of code as there will have to be a queue and buffer. Unfortunately that wont be as simple as dropping in a few lines of code and making a simple API endpoint.... sadly.

Obviously Im working on v2 of AllTalk and all these sorts of new features/code. Id rather spend my effort trying to implement it in the new v2 and it may or may not be simply transportable onto v1.9.

Im almost at a stage where Ive gotten one chunk of coding done, so will be able to move onto the next thing and I can try getting the change to streaming working. If I can get it working, while you are working on your own code, would you be ok if I gave it to you in a very BETA version of AllTalk v2? (It will be fully compatible with 1.9), but it is no way production ready. If I can get it working though, it would allow you to get on with your own coding/testing.

No guarantees though.... It could end up being complicated for reasons Ive not yet considered.

RenNagasaki commented 7 months ago

For sure, that's far more than I couldve hoped for!

Thanks for all the work you put into this tool!

erew123 commented 7 months ago

Well, hopefully Ill have this current block of code done soon and then I can take a look at streaming options over the next couple of days... hopefully. Will let you know.

erew123 commented 7 months ago

I have some potentially good news for you....

image

I have found a way to stop the streaming generation that doesn't cause any major fallout. The wav that it sends over is still a complete wav (as far as its gotten into generating it), nothing screams or shouts about errors and its managed to be simpler than I thought.

Saying that, a word of note, if you look at my screenshot, the text I sent was about 120 chunks long and I broke it generating at chunk 19/20, which stops the generation process, but if your application is playing chunk 10 at the moment, it will continue on playing until it has reached chunk 20 (the end of the generated wav) OR you can press Stop on your player (or should I say programmatically send a STOP to your player).

So it should provide some of what you want.... generating and managing a queue system is a separate thing of course. And I think I will create a queue setup that has 2x options.... on a new text generation request being sent it will:

A) Stop the current streaming generation (as above) and then start generating the new text that was sent over. OR B) Queue up the new text and then generate it once the current generation has finished.

Before I can send you some code (which I think I should be able to do in 1.9 of Alltalk, with it being pretty simple, I need to just have a little think as to how I implement the endpoint.

So give me a little more time on this to wrap my head around it, now I have some working code ideas.

Thanks

erew123 commented 7 months ago

Ok, its up there.

Because it was simple enough to do, Ive dumped it into v1.9 and just updated the code so you can git pull. So hitting the endpoint with a PUT

curl -X PUT http://localhost:7851/api/stop-generation

That will stop the generation occurring however far its gotten into it. So within your own code, assuming you have something playing back at the moment while its generating the stream you would:

1) Send out a curl -X PUT http://localhost:7851/api/stop-generation to that endpoint which would stop the stream generating any further. 2) Assuming you want the the current audio to stop, you would programmatically send a STOP to your audio player. Or maybe a pause audioPlayer.pause(); (however your code handles that). 3) You would now send off the new TTS generation request as normal and it should start generating.

Just so you understand what is happening in AT, when you send this:

curl -X PUT http://localhost:7851/api/stop-generation

Its switching stop_generation = False to stop_generation = True

When AT is in its for loop where it sends chunks of text at the XTTS AI model, it checks in between each chunk for stop_generation = True. If it finds that, it wipes out the remaining text chunks from the rest of the buffer and cleanly breaks the operation at the last generated WAV chunk. Meaning its then waiting for the next generation request.

Obviously I have not given this a ton of testing, but 20+ish tests it seems to work ok, though Im not going to document this yet for general availability.

Now its possible as I get further along coding other things, I may change the endpoint name OR even the call to it. And as mentioned I will intend to do other code that:

on a new text generation request being sent it will:

A) Stop the current streaming generation (as above) and then start generating the new text that was sent over.
OR
B) Queue up the new text and then generate it once the current generation has finished.

So it may be in future you could have to change the endpoint name OR indeed you could change the flag in AllTalk to perform A (no need to send a stop command, just send off your next request). But Ill have to build the queue system for that and some extra logic, along with finishing the settings endpoint to allow you to manipulate all the settings within AllTalk remotely (all planned for v2).

That code change should at least give you what you need though :)

RenNagasaki commented 7 months ago

Ohhh, that sounds awesome! Thanks for being that fast! ❤️

RenNagasaki commented 7 months ago

Works like a charm!

RenNagasaki commented 7 months ago

@erew123 One small Problem is there though. It could happen that I set the stop flag a tad too late. (Directly after it finished generating) the flag then stays set for the next generation which stops that one instantly. So if possible, could you change it so that the flag only can be set when it actually generates? Or gets reset at the start of another generation.

erew123 commented 7 months ago

@RenNagasaki I was just having a sit down/break and thinking though all of these logic issues and that idea did come to mind. Ill need to introduce a lock at the start of generation which unlocks at the end of generation. We can then check that lock on a new generation request and based on the behaviour of:

A) Stop the current streaming generation (as above) and then start generating the new text that was sent over.
OR
B) Queue up the new text and then generate it once the current generation has finished.

we can then decide how we handle that scenario, queue it or undo the lock and kill the process etc.

I will sort it sometime soon. I was just trying to give you something that would at least get you going.

If you temporarily want to add an endpoint just to unflip it (while you are coding, should it get set the wrong way)

@app.put("/api/start-generation")
async def start_generation_endpoint():
    global stop_generation
    stop_generation = False
    return {"message": "Generation started"}

you can put that in below the stop generation one in tts_server.py and curl -X PUT http://localhost:7851/api/start-generation

as a for now work around.

RenNagasaki commented 7 months ago

Ohh, that works perfectly. Thank you!

erew123 commented 7 months ago

The code now has a lock in it:

    if tts_generation_lock and not tts_stop_generation:
        tts_stop_generation = True

Meaning that tts_generation_lock has to be True (aka, its actually still generating something) and tts_stop_generation has to be False (meaning its not been set to True already) and only then will it change tts_stop_generation = True resulting in the next chunk causing a stop to occur.

Otherwise, it will not make any change. Hence, you cannot set the stop flag too late anymore.

So no need to use the temporary endpoint anymore. If you want, you could just now always fire off a stop request before you send the TTS generation request off and I think that should work ok (I don't think you will hit a situation where there is some lock occurs).

Thanks