erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
864 stars 98 forks source link

Different voice for different emotion of character / narrator #314

Closed shivshankar11 closed 4 weeks ago

shivshankar11 commented 4 weeks ago

Is your feature request related to a problem? Please describe. for sillytavern - Different voice for different emotion of character / narrator, extras api support emotion detection.

Describe the solution you'd like Different voice for different emotion

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

erew123 commented 4 weeks ago

Hi @shivshankar11

I think you are asking to be able to specify emotion [happy, sad, joy, anger etc]. These features are TTS engine specific and the XTTS engine does not support this feature. It was on their roadmap as "Implement emotion and style adaptation." https://github.com/coqui-ai/TTS/issues/378

As for any other engines that are currently implemented, they do not support this type of feature. Such a feature will only be possible as/when I can implement a TTS engine that supports such a thing (As mentioned on the front of Github, I do not make the TTS engines https://github.com/erew123/alltalk_tts?tab=readme-ov-file#%EF%B8%8F-about-this-project--me

Finally, any TTS engines I know of that support any feature like this, require that the emotion style is sent over as part of the TTS generation request e.g.

[angry] Don't tell me that. [happy] Lets go out to the park today

As such, both the AI model/llm used would have to be capable of sending such information in its text, so that it can then be forwarded on for TTS generation and SillyTavern would probably have to code such a feature into their interface before I could pass the the text through to an emotion capable model.

I am looking at other models, please see the feature requests here https://github.com/erew123/alltalk_tts/discussions/74, however the work within ST or with LLM's will still need to be done by the people whom deal with those things.

Thanks

shivshankar11 commented 4 weeks ago

i want to use happy/sad sounding/ energetic voice mp3/wav files for cloning, not TTS engine specific.

erew123 commented 4 weeks ago

Hi @shivshankar11 I provide extra voices on the links there, https://github.com/erew123/alltalk_tts?tab=readme-ov-file#-other-installation-notes

Beyond that, you can create your own wav files from any audio you can find. This is detailed in the help section of the TTS engine.

image

I only provide limited samples because of copyright issues and I cannot include audio samples that have any form of legal copyright claim on them.

Thanks

shivshankar11 commented 4 weeks ago

can you look into this https://github.com/SillyTavern/SillyTavern-Extras --classification-model | Load a custom sentiment classification model.Expects a HuggingFace model ID.Default (6 emotions): we can select audio sample file based on emotion status provided by sentiment classification model.

erew123 commented 4 weeks ago

Hi @shivshankar11 Ive had a look at that, its bit too far out of the core of what I am trying to do with AllTalk and I already have a huge block of code specific to the core of AllTalk to work on. I just dont have time at the moment to work on code of that level thats more than basic integration into another application e.g. ST.

I also note its a dead project, however I can see what someone has taken on maintaining it and updating it https://github.com/Abdulhanan535/SillyTavern-ExtrasFix and has been making changes are recent as last week.

image

I'm assuming they will have a reasonable grasp of that code base, so it may be better to approach them about building integration to other TTS engines via their API calls.