erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
1.05k stars 113 forks source link

Provide an OpenAI TTS conforming api #133

Closed jmtatsch closed 7 months ago

jmtatsch commented 7 months ago

Many user interfaces would profit from just having to implement the clearly defined OpenAI api https://platform.openai.com/docs/guides/text-to-speech instead of a custom one for each possible TTS provider.

I read up some of this repo issues but it didn't become clear to me if all talk_tts already supports it or not? If true then this should maybe be added more prominently in the readme. If not this would be a very useful addition to this project ;)

erew123 commented 7 months ago

Hi @jmtatsch

AllTalk doesn't have an implementation of the OpenAI API for TTS. There is however a good (in my mind) technical reason for this, which is that all TTS models have their own unique calls, features, requirements etc, which the OpenAI API doesn't include ways to address these.

There are features that I wanted to include such as the Narrator, the ability to play audio through the console where AllTalk is running, different methods to get the generated TTS etc.

With the XTTS model as it stands, and using the OpenAI API calls, I believe the only calls you could make would be input and voice. You certainly wouldn't want to be loading the model on each call as you would be adding anything from 5-25 seconds per call. The response format is fixed to WAV as that's what XTTS supports. Speed, well, yes, XTTS model does support limited speed manipulations, so in theory that could be implemented.

https://platform.openai.com/docs/api-reference/audio/createSpeech

So I don't believe the OpenAI API would give enough control/flexibility for the XTTS model (or other potential models that may get included in future).

However, saying all of that, I do understand that a basic endpoint that acted as an OpenAI API would simplify integration into other software that already can make calls to OpenAI's API. But. obviously you would have to chop down the features accessible over the API calls.

That's my current take on it, however, if I've misunderstood what you're asking for, or you have further thoughts/suggestions around this, please let me know.

Thanks

jmtatsch commented 7 months ago

Thank you for your detailed answer.

I really only wanted to point out that additionally supporting the openai api would unlock new levels of usefulness for me and many other people.

And as you yourself have pointed out there wouldn't be too much to implement aside from input and voice. All other fields could just stay duds.

erew123 commented 7 months ago

Sure, well I've made a note of it in the Features Requests list https://github.com/erew123/alltalk_tts/discussions/74

I guess we would have to figure something to match the voices that OpenAI use too, as in name wise, as any software using OpenAI would be limited to those voices anyway (at least by name).

If you @jmtatsch or anyone has OpenAI setup for TTS, Id love to see an example of the CURL response it gives, so that could match that: https://platform.openai.com/docs/api-reference/audio/createSpeech?lang=curl


  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "alloy"
  }' \
  --output speech.mp3```
erew123 commented 7 months ago

Oh, Im going to close this so its not in the issue list, but as I said, its referenced on the Feature requests and you or others are welcome to keep responding here.

Thanks

jmtatsch commented 7 months ago

For this request

-H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "alloy"
  }' \
  --output speech.mp3

you really get just a binary response, the mp3 file which I cannot upload ;)

If you dont have any money on your account (like me) you get

curl https://api.openai.com/v1/audio/speech \                                                                                                                                     
  -H "Authorization: Bearer xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "alloy"
  }'
{
    "error": {
        "message": "You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.",
        "type": "insufficient_quota",
        "param": null,
        "code": "insufficient_quota"
    }
}
UXVirtual commented 6 months ago

I've been using open-webui as a frontend for Ollama for testing LLM models. It has support for using OpenAI's API for TTS generation on responses and it's possible to use an OpenAI API compatible local TTS for this. It typically sends a request as follows, specifying the output format:

{
    "model": "tts-1",
    "input": "Hello there, nice to meet you.",
    "voice": "fable",
    "response_format": "mp3",
    "speed": 1.0
}

The response sends back the data chunked, which is automatically handled by browsers if you assign the fetch request response to a MediaSource and use that as the src value on an Audio element (see here for an implementation)

I've also been experimenting with adapting my Unity LLM frontend to work with OpenAI's API and noticed that OpenAI is able to send back raw PCM as a streaming response, which Unity can handle if you store incoming chunks of the response in a buffer and then create AudioClips on the Unity side for playback in an AudioSource. Previously I also used this approach to implement streaming support for the current AllTalk API where it returns WAV files, with some minor processing using an audio library. The PCM approach is preferred for Unity as it can just play the raw response.

jmtatsch commented 6 months ago

I am using openedai-speech now in conjunction with open-webui which works well enough for me.