erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
686 stars 71 forks source link

Cut off Audio sometimes. #202

Closed RenNagasaki closed 2 months ago

RenNagasaki commented 2 months ago

I feared you'd get bored without an issue by me so I found a new problem. 🤣 Sometimes the generated audio isn't the whole text. It stops a sentence or two early. After regenerating its then fixed. But since I'm using streaming, regenerating isn't exactly an option. 😅

I'm not quite sure where to start here. Just ask if you need any additional info.

To Reproduce I don't know. Try to generate some Audio? 🤣

erew123 commented 2 months ago

@RenNagasaki I'm really not bored. I appreciate your generous offer, but honestly, I'm pretty good thanks! 🤣

This may be an issue with Coqui's scripts, rather than AllTalk.

QUESTION: Im assuming that when you notice skipped audio, it still shows up in the AllTalk terminal/console as having been generated/sent to generation?

Its a difficult one to address, however, saying that I am just working on the (stupidly complex) code to allow AllTalk to import/use any TTS engine. Im currently working with Piper as the first one to import, as its so different from the way XTTS works. Once I have those 2 working, setting up other engines should be comparatively much simpler.

image

So this will allow a situation, where, with other engines that can stream audio, we can test is it AllTalk or is it the TTS engine script by whomever

On top of that, Im looking to build a queue that is separate to the TTS queue system provided by the TTS engine manufacturer. The idea here is that, if someone sends over a huge 5,000 character text to generate as TTS, we could push it in for generation a bit at a time (say 2 sentence chunks) which would allow for greater control in stopping the current generation on different methods, but also help with tracking what's has been sent to a TTS engine.

Let me know on the question above, but I may look to bump this into the V2 AllTalk as something to investigate because of the things I mention above, allowing for simpler testing and diagnosis.... if youre ok with that?

Thanks

RenNagasaki commented 2 months ago

@RenNagasaki I'm really not bored. I appreciate your generous offer, but honestly, I'm pretty good thanks! 🤣

This may be an issue with Coqui's scripts, rather than AllTalk.

QUESTION: Im assuming that when you notice skipped audio, it still shows up in the AllTalk terminal/console as having been generated/sent to generation?

Yes, its presented to me as a normal generation/stream in console. It just stops early.

Its a difficult one to address, however, saying that I am just working on the (stupidly complex) code to allow AllTalk to import/use any TTS engine. Im currently working with Piper as the first one to import, as its so different from the way XTTS works. Once I have those 2 working, setting up other engines should be comparatively much simpler.

image

So this will allow a situation, where, with other engines that can stream audio, we can test is it AllTalk or is it the TTS engine script by whomever

On top of that, Im looking to build a queue that is separate to the TTS queue system provided by the TTS engine manufacturer. The idea here is that, if someone sends over a huge 5,000 character text to generate as TTS, we could push it in for generation a bit at a time (say 2 sentence chunks) which would allow for greater control in stopping the current generation on different methods, but also help with tracking what's has been sent to a TTS engine.

Let me know on the question above, but I may look to bump this into the V2 AllTalk as something to investigate because of the things I mention above, allowing for simpler testing and diagnosis.... if youre ok with that?

Thanks

Ohhhh, I'm really interested in that. And sure. Take your time! 😁

erew123 commented 2 months ago

Lets bump this into a v2 thing to look at/test then :)

I want to say I'm 60% done with v2's code, but some things have gone quickly and were easier than I expected and then sometimes you spend 6 hours trying to figure out one stupid little bug/issue. It may be I'm more like one side of 50% done and then I guess there will have to be a big tidy through the code, check its all nice looking/easy to understand. And then, making sure its documented or at least the on-screen interface is telling you everything you need (So that I don't get lots of questions on issues page).

I'm kind of wanting to say that Ill have a beta of v2 within a couple of weeks, It may be sooner/may be later. If I can get the TTS model loaders going, then that will be a big step towards a working v2 beta. Ive already reworked the API endpoints+communication to start doing lots of extra clever things. Simplified the model selection/discovery. Built a remote Text-gen-webui extension to speak with a remote AllTalk (which allows me to test the new API calls work etc).......So yeah, getting there it by bit. But yeah, the new TTS Engines code, its complicated, but its complicated for me, with the idea being that it would be simple for others to add a new TTS engine (in the hope that other people may just want to implement any new TTS engine they like into AllTalk, along with the work on the API making it very simple to integrate with anything people want, plus also OpenAPI speech API compatibility etc).

So yeah, Im getting there.....

Ill close this ticket for now and lets re-investigate when weve got a v2 to look at/test with.