Open Ph0rk0z opened 1 year ago
I just opened issues to ask for Bark too!
The cloning sort of sucks right now but the default voices aren't so bad.
This would be fun, although it would also need addition of narrator voice for all the actions the character does.. Or it to just speak the lines it actually says out loud. Tortoise-tts is another option
Right now I saw we got silero (recently?)
Silero is not yet integrated with this API. I'm waiting for a pip package release
I used silero today.. It won't play the available voices for some reason. It does work though through that API server which I had to edit. Works ok, could use some quality of life improvements but it's aight. For instance to see if your voice file is done or if generation started, etc.
I used the silero pip. Having 2 gpus, eager for tortoise or maybe even coqui. Once bark sorts itself out.. I think it will be awesome actually acting out your roleplay and not just reading like a robot.
Would love to see Bark TTS here!
There is also a new bark I am trying https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer
Cloning might work in that one.
didn't expect to see my repo mentioned here lol
(If anyone needs help implementing the voice cloning quantizer, I'll gladly help. Note that there currently isn't a multilingual model, but once enough languages are made by the community, I can train a multilingual model by combining the datasets.)
It's the best clones I've gotten with so little data. If only bark could keep it up from sentence to sentence.
Yeah, if you want better results at somewhat higher inference time, combine it with RVC, included in my audio-webui, just generate with bark, then click "Send to RVC". Make sure you have a trained RVC model, I'm currently working on porting the training code to the webui for an easier interface. Button to estimate epochs, stuff like that.
I'm waiting on you to put in RVC training and making some tortoise models in the mean time. It doesn't do too bad and trains in 5hrs.
Just generation on all of these is a bit slow for real time text. Hopefully one day that gets solved because it takes a bit even on a 3090.
5 hours?
With bark + RVC, the training of RVC takes like half an hour total, you won't have to train for long, and the results will probably be better than tortoise. (example below)
To return the discussion back on track, I currently don't see a way I could integrate that into the Extras pipeline. The idea was to provide pre-trained voices and let Extras do the inference, or to embed the training into the Extras itself? Any thoughts on what's the desired behavior here?
For cloning with bark alone, no training is required, just a short audio clip. This clip is then sent to bark as a history prompt, and the rest will be generated from there.
There could be an option to create a voice from an audio clip. Which is then used for generations with bark. Creating a speaker is faster than generating with bark itsself, so it should be fine being built-in.
Wow, that is good. I'm just doing with what I have. I don't have RVC models for everything I want and didn't look for some other way to train them. For the ones that I did have, the first sentence is always great but then the next sentence it becomes some dude or another voice. Very hard to keep on track.
To put it in this repo, we could simply use patched bark with our speaker.npz that were made through something like the audio-ui.They're only a few kilobytes and as you see.. they work well.
bark is incoming with the addition of coqui. It is working on my dev install. https://github.com/SillyTavern/SillyTavern-extras/tree/neo and https://github.com/SillyTavern/SillyTavern/pull/775 . Be warned Bark is extremely GPU mem heavy!! I recommend using VITS see: https://www.youtube.com/watch?v=6QAGk_rHipE&t=137s
Thanks! I have bark working in textgen but I prefer all the extra features ST for RP. I'm a lucky one and have a whole 24g GPU to dedicate to SD + TTS. I will also try to train a vits tho because why not. Bark consistency on longer gens pretty much isn't.
Between expressions, memory and everything else, I think this would really kick immersion up.
https://github.com/serp-ai/bark-with-voice-clone