SillyTavern / SillyTavern-Extras

Extensions API for SillyTavern.
GNU Affero General Public License v3.0
549 stars 124 forks source link

Bark TTS with voice clone? #15

Open Ph0rk0z opened 1 year ago

Ph0rk0z commented 1 year ago

Between expressions, memory and everything else, I think this would really kick immersion up.

https://github.com/serp-ai/bark-with-voice-clone

Priestru commented 1 year ago

I just opened issues to ask for Bark too!

Ph0rk0z commented 1 year ago

The cloning sort of sucks right now but the default voices aren't so bad.

ghost commented 1 year ago

This would be fun, although it would also need addition of narrator voice for all the actions the character does.. Or it to just speak the lines it actually says out loud. Tortoise-tts is another option

Ph0rk0z commented 1 year ago

Right now I saw we got silero (recently?)

Cohee1207 commented 1 year ago

Silero is not yet integrated with this API. I'm waiting for a pip package release

Ph0rk0z commented 1 year ago

I used silero today.. It won't play the available voices for some reason. It does work though through that API server which I had to edit. Works ok, could use some quality of life improvements but it's aight. For instance to see if your voice file is done or if generation started, etc.

I used the silero pip. Having 2 gpus, eager for tortoise or maybe even coqui. Once bark sorts itself out.. I think it will be awesome actually acting out your roleplay and not just reading like a robot.

FerLuisxd commented 1 year ago

Would love to see Bark TTS here!

Ph0rk0z commented 1 year ago

There is also a new bark I am trying https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer

Cloning might work in that one.

gitmylo commented 1 year ago

didn't expect to see my repo mentioned here lol

(If anyone needs help implementing the voice cloning quantizer, I'll gladly help. Note that there currently isn't a multilingual model, but once enough languages are made by the community, I can train a multilingual model by combining the datasets.)

Ph0rk0z commented 1 year ago

It's the best clones I've gotten with so little data. If only bark could keep it up from sentence to sentence.

gitmylo commented 1 year ago

Yeah, if you want better results at somewhat higher inference time, combine it with RVC, included in my audio-webui, just generate with bark, then click "Send to RVC". Make sure you have a trained RVC model, I'm currently working on porting the training code to the webui for an easier interface. Button to estimate epochs, stuff like that.

Ph0rk0z commented 1 year ago

I'm waiting on you to put in RVC training and making some tortoise models in the mean time. It doesn't do too bad and trains in 5hrs.

Just generation on all of these is a bit slow for real time text. Hopefully one day that gets solved because it takes a bit even on a 3090.

gitmylo commented 1 year ago

5 hours?

With bark + RVC, the training of RVC takes like half an hour total, you won't have to train for long, and the results will probably be better than tortoise. (example below)

https://github.com/SillyTavern/SillyTavern-extras/assets/36931363/362d6066-6e0a-4c52-a9e3-3ca12773931e

Cohee1207 commented 1 year ago

To return the discussion back on track, I currently don't see a way I could integrate that into the Extras pipeline. The idea was to provide pre-trained voices and let Extras do the inference, or to embed the training into the Extras itself? Any thoughts on what's the desired behavior here?

gitmylo commented 1 year ago

For cloning with bark alone, no training is required, just a short audio clip. This clip is then sent to bark as a history prompt, and the rest will be generated from there.

There could be an option to create a voice from an audio clip. Which is then used for generations with bark. Creating a speaker is faster than generating with bark itsself, so it should be fine being built-in.

Ph0rk0z commented 1 year ago

Wow, that is good. I'm just doing with what I have. I don't have RVC models for everything I want and didn't look for some other way to train them. For the ones that I did have, the first sentence is always great but then the next sentence it becomes some dude or another voice. Very hard to keep on track.

To put it in this repo, we could simply use patched bark with our speaker.npz that were made through something like the audio-ui.They're only a few kilobytes and as you see.. they work well.

pyrater commented 1 year ago

bark is incoming with the addition of coqui. It is working on my dev install. https://github.com/SillyTavern/SillyTavern-extras/tree/neo and https://github.com/SillyTavern/SillyTavern/pull/775 . Be warned Bark is extremely GPU mem heavy!! I recommend using VITS see: https://www.youtube.com/watch?v=6QAGk_rHipE&t=137s

Ph0rk0z commented 1 year ago

Thanks! I have bark working in textgen but I prefer all the extra features ST for RP. I'm a lucky one and have a whole 24g GPU to dedicate to SD + TTS. I will also try to train a vits tho because why not. Bark consistency on longer gens pretty much isn't.