erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
864 stars 98 forks source link

Specify v1/completions source (user/bot/api etc.) #128

Closed afhutanu closed 6 months ago

afhutanu commented 6 months ago

Is your feature request related to a problem? Please describe. Hi, I am working on integrating tts to the sd_api_pictures in such a way as to get the current reply and call the API to generate the contextual tags. I am doing this via http://127.0.0.1:5000/v1/completions. For example, my payload is:

{
    "prompt": "Say 'this is a test'",
    "max_tokens": 30,
    "echo": "True"
  }

Response:

{
    "id": "conv-1710440097223950848",
    "object": "text_completion",
    "created": 1710440097,
    "model": "fls",
    "choices": [
        {
            "index": 0,
            "finish_reason": "length",
            "text": "Say 'this is a test'<audio src=\"file/extensions/alltalk_tts/outputs/TTSOUT_1710440098.wav\" controls autoplay></audio> and pause for a response.\nWhen writing this code, I made sure to close all resources, including files. I want to make sure this is",
            "logprobs": {
                "top_logprobs": [
                    {}
                ]
            }
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "completion_tokens": 74,
        "total_tokens": 82
    }
}

As you can see tts is generating the audio files even though I am using Postman in this example and not the UI for speech.

Describe the solution you'd like If possible, could we maybe add a parameter that will indicate whether or not the completion endpoint was called from the UI/API before generating the audio file?

Describe alternatives you've considered I've tried leveraging directly the loaded llm from the ExLlamaV2 module to generate the completion rather than use the API, but could not figure out how to make it work.

Additional context This is just a side project, nothing urgent, I appreciate all the work on this. Thank you!

erew123 commented 6 months ago

Hi @afhutanu

I don't know of any way to separate/segregate the calls from within Text-gen-webui and I suspect it may not be possible to do that either.

Here's a tech explanation as to why, based on my work on SillyTavern and AllTalk:

As far as AllTalk (or any other extension within Text-gen-webui goes) the way they are called is that you create a file called "script.py", put it in a folder in the "extensions" area and this is what text-gen-webui uses to go "aha, that's an extension I can load and interact with".

From there, the communication is pretty simple. Text-gen-webui sends over "string" to each and every extension which is basically a variable with the text contents that the AI LLM has created. So as the extension (AllTalk for example) all you know is that you've been sent this string, and its just text. The string carries no other information with it about what generated it, where it came from etc. This string is sent when the LLM generates a response, no matter how it is called to generate a response.

To put this in simple terms, with your example above, all that AllTalk (or any other extension would see) is this is a test that's literally all that gets handed over via string, so there is no way for me to make AllTalk go, aha, that's come from the API externally (or whatever).

I did look into this issue a few months ago when I made the SillyTavern integration, but had to leave a note that people need to uncheck the "Enable TTS" option when using SillyTavern with Text-gen-webui, otherwise AllTalk will dual generate the TTS as Text-gen-webui will generate when the LLM responds and then SillyTavern separately makes a call to AllTalk to generate TTS.

Obviously, if Text-gen-webui was changed in some way to send out information about where a request came from, I could do it, however, the problem there is every extension that exists for Text-gen-webui would have to be updated/changed to support this, which is probably unlikely to happen.

The only possibility I can think of as a half way solution for you, would be to uncheck the "Enable TTS" within Text-gen-webui, meaning that AllTalk doesn't get anything sent over from Text-gen-webi. And then when Text-gen-webui responds to your external app, you make a separate API call to AllTalk https://github.com/erew123/alltalk_tts?tab=readme-ov-file#-api-suite-and-json-curl and you can of course specify filename or whatever. Obviously the downside to that is that AllTalk is disabled as far as using it within the Text-gen-webui interface directly, which I'm kind guessing you are wanting both working at the same time.

Hopefully that at least gives you a clear answer on the issue!

Thanks

afhutanu commented 6 months ago

Hi @erew123!

Thank you so much for taking the time to get back to me, and for the very well structured and explained response. It's a shame it does not work but I'm glad I've confirmed this. I've got quite a beefy PC so I guess I'll just keep two instances of Text-gen-webui and disable tts on the second one and just route the API calls there.

Thanks so much!