Allow emotion variations of XTTS latents

art-from-the-machine / Mantella

Mantella is a Skyrim and Fallout 4 mod which allows you to naturally speak to NPCs using Whisper (speech-to-text), LLMs (text generation), and Piper / xVASynth / XTTS (text-to-speech).

GNU Affero General Public License v3.0

191 stars 50 forks source link

XTTS can struggle with variations in emotion in text prompts. However, if trained solely on wav files of a certain emotion, XTTS can carry that emotion across text prompts.

A workaround for XTTS's limitation in varied emotions would be to create a separate latent file for each emotion of a given voice model, and call the required model based on the emotion of the sentence. This emotion can be decided by the LLM via actions (similar to Offended and Follow).

For example, if a "neutral" latent is simply femalenord.json, a "happy" latent could be called via a search for femalenord_happy.json.

art-from-the-machine / Mantella

Allow emotion variations of XTTS latents #243