collabora / WhisperSpeech

An Open Source text-to-speech system built by inverting Whisper.
https://collabora.github.io/WhisperSpeech/
MIT License
3.73k stars 201 forks source link

Emotion markers #45

Open zclch opened 7 months ago

zclch commented 7 months ago

It would be amazing if emotion markers can be supported (or if they already are, documentation on how to use them), for example providing indicators like <angry>, <excited>, etc. or use of emoji's for the same.

jpc commented 7 months ago

This is a nice idea. The trick seems to be getting good ground truth emotional speech samples and labels that don't sound fake.

This is not currently supported in any way but there are people in the LAION Discord working on emotional speech so I'll invite you to join the #audio-generation there channel if you are interested.

zclch commented 7 months ago

This is a nice idea. The trick seems to be getting good ground truth emotional speech samples and labels that don't sound fake.

This is not currently supported in any way but there are people in the LAION Discord working on emotional speech so I'll invite you to join the #audio-generation there channel if you are interested.

@jpc , that sounds great, I would love to follow along with their progress. How can we get in contact?

jpc commented 7 months ago

It’s best if you ask on the LAION Discord (link in the README).

daniel-wf-alves commented 5 months ago

If I can collect a large dataset of clean speech, annotated with text and "emotion vectors" which I assume to be accurate, how can I train whisperspeech on it? This would be like training a new multimodal tts model, going from text + emotion vectors -> speech. Can you give me high level guidance on where in the training pipeline I could add support for this extra modality and how to do it?