Open Ca-ressemble-a-du-fake opened 3 weeks ago
The Toucan TTS model is not very good at voice cloning, the focus is rather on speaking many languages. So it's expected that the zero-shot performance on unseen voices is much worse than e.g. the system from Coqui.
The sampling rate should be 16kHz or higher and the length should be around 6 to 10 seconds ideally. The sampling rate is adjusted automatically, you don't need to care about that, just the duration can have an impact.
You give me the idea to automate some preprocessing on the reference that is given to the system. I can cut out the silences and cut/ensemble or repeat the reference to get the ideal length. Since changing the utterance embedding is typically not something that is done super often, it's fine if the runtime for the switch gets a tiny bit worse, if that means that the performance gets better. I will try to get it done in the next few days and then it really doesn't matter anymore what you put in.
Thanks Florian for your advice! Will test your update when it's ready. To get better results should I finetune the model with a dataset of the target speaker voice, or the model will also fight to reproduce the target speaker ?
By the way are you planning on improving voice-cloning / mimicking ? What's lacking to get better results at voice-cloning ?
Hi,
First of all congrats for this new version! This is very easy to use and also fast (fast enough on laptop CPU).
As far as speaker_reference / voice to mimic is concerned what do you advise regarding its duration, format. Are there best practices you recommend ?
For example CoquiTTS used to recommend a 6 second voice 22kHz wav extract
So far I couldn't get cloning results as good as with CoquiTTS, the output voice is shivering a little bit and seems to lack harmonics. I used the simple GUI to do so.
Kind regards from France