art-from-the-machine / Mantella

Mantella is a Skyrim and Fallout 4 mod which allows you to naturally speak to NPCs using Whisper (speech-to-text), LLMs (text generation), and xVASynth / XTTS (text-to-speech).
https://art-from-the-machine.github.io/Mantella/
GNU General Public License v3.0
164 stars 41 forks source link

Allow emotion variations of XTTS latents #243

Open art-from-the-machine opened 4 months ago

art-from-the-machine commented 4 months ago

XTTS can struggle with variations in emotion in text prompts. However, if trained solely on wav files of a certain emotion, XTTS can carry that emotion across text prompts.

A workaround for XTTS's limitation in varied emotions would be to create a separate latent file for each emotion of a given voice model, and call the required model based on the emotion of the sentence. This emotion can be decided by the LLM via actions (similar to Offended and Follow).

For example, if a "neutral" latent is simply femalenord.json, a "happy" latent could be called via a search for femalenord_happy.json.

Pendrokar commented 4 months ago

Unlike what I achieved xVASynth with DeepMoji this proposition is a strict emotional switch. It may work but the final voice would lose the general tone of its original voice. Therefore sounding like a different character.