art-from-the-machine / Mantella

Mantella is a Skyrim and Fallout 4 mod which allows you to naturally speak to NPCs using Whisper (speech-to-text), LLMs (text generation), and xVASynth / XTTS (text-to-speech).
https://art-from-the-machine.github.io/Mantella/
GNU Affero General Public License v3.0
170 stars 43 forks source link

Allow External TTS services #104

Open Pendrokar opened 8 months ago

Pendrokar commented 8 months ago

Not sure about API standards, so might as well have a JSON file with contains JSON data from which placeholders are replaced. Such as:

Pendrokar commented 8 months ago

My Coqui XTTS branch: https://github.com/Pendrokar/Mantella/tree/xtts_client

OpenReplicant commented 8 months ago

yes! i came here just to check for this.

so, with your fork i can just swap out these values with local Coqui HTTP endpoints, and the voices folder has wav inputs for cloning?

; tts_synthesize_url ; External TTS service (other options don't matter) ; URL that returns the full audio file (POST) ; This will be used if there is no skyrim_voice_folder for the model tts_synthesize_url = https://ampland-epa-demands-humanities.trycloudflare.com/tts_to_audio

; tts_stream_url ; External TTS service (other options don't matter) ; URL that returns the chunked audio file (GET) ; This will be used if there is no skyrim_voice_folder for the model tts_stream_url = https://ampland-epa-demands-humanities.trycloudflare.com/tts_stream

Pendrokar commented 8 months ago

@OpenReplicant oh sorry, seems I forgot to mention which API it expects on the XTTS server within the code of that branch. It is this one: https://github.com/daswer123/xtts-api-server/tree/main

It is a very basic implementation, supporting the multi speaker model. The WAV inputs have to already be on the server. Maybe single speaker model as well, if the server is able to automatically switch to the requested model.

Revelant code part for any Web Request changes: https://github.com/Pendrokar/Mantella/blob/xtts_client/src/tts.py#L210-L216