Open fivestones opened 1 year ago
Bark is pretty much brand new, right now I haven't found any ways to increase the consistency. It's showing huge promise, but voice cloning seems to be nearly random to me, and as you've noted it's not always coherent from one output to the next. This is just the nature of generative AI, but you can try to tame it some and then later refine it.
Have you played with the temperature? I'm not real clear on waveform vs text temperature controls, but would recommend cooling them off a bit if you're looking for consistency.
The other thing you can do is run the full (joined) audio output through a speech encode / decode cycle in other tools that will help it mesh a bit better. I'd recommend looking into so-vits rvc as a place to start. You can also apply some voice filtering and other things there which would be useful for audio refinement for audio books.
Bark's expressiveness is really really good, but without fine tuning I think a very consistent voice will be difficult to achieve. Try to find a voice that's as consistent as possible, use bark for the natural sounding output, and then look at speech-to-speech encoding to get it all sounding proper. That's the path I'm currently on, anyway.
Hey! I like what you're doing with this! I was following your comments on hackernews a couple of weeks ago and you were saying that when you had a chance you'd make a youtube video or do something else to explain how you managed to get some things to work in bark (clone voices, etc). I'd especially like to know if there's any good way to do long format audio, like making an audiobook. I read somewhere where you said you aren't making audiobooks, but maybe you know some settings that would be good for this? When I've tried to use bark infinity to to tts on any longer piece of text, the final product sounds very choppy--there are lots of extra or cut off bits where the individual audio files were connected together. And the individual audio files sound enough different from each other that it feels like different people reading each sentence or so sometimes. Do you know how to make this better? I'd love to learn more from you if you have time to share this somewhere. Thanks!