Open art-from-the-machine opened 7 months ago
Unlike what I achieved xVASynth with DeepMoji this proposition is a strict emotional switch. It may work but the final voice would lose the general tone of its original voice. Therefore sounding like a different character.
XTTS can struggle with variations in emotion in text prompts. However, if trained solely on wav files of a certain emotion, XTTS can carry that emotion across text prompts.
A workaround for XTTS's limitation in varied emotions would be to create a separate latent file for each emotion of a given voice model, and call the required model based on the emotion of the sentence. This emotion can be decided by the LLM via actions (similar to Offended and Follow).
For example, if a "neutral" latent is simply femalenord.json, a "happy" latent could be called via a search for femalenord_happy.json.