DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.44k stars 161 forks source link

Advices on voice to mimic #198

Open Ca-ressemble-a-du-fake opened 3 weeks ago

Ca-ressemble-a-du-fake commented 3 weeks ago

Hi,

First of all congrats for this new version! This is very easy to use and also fast (fast enough on laptop CPU).

As far as speaker_reference / voice to mimic is concerned what do you advise regarding its duration, format. Are there best practices you recommend ?

For example CoquiTTS used to recommend a 6 second voice 22kHz wav extract

So far I couldn't get cloning results as good as with CoquiTTS, the output voice is shivering a little bit and seems to lack harmonics. I used the simple GUI to do so.

Kind regards from France

Flux9665 commented 3 weeks ago

The Toucan TTS model is not very good at voice cloning, the focus is rather on speaking many languages. So it's expected that the zero-shot performance on unseen voices is much worse than e.g. the system from Coqui.

The sampling rate should be 16kHz or higher and the length should be around 6 to 10 seconds ideally. The sampling rate is adjusted automatically, you don't need to care about that, just the duration can have an impact.

You give me the idea to automate some preprocessing on the reference that is given to the system. I can cut out the silences and cut/ensemble or repeat the reference to get the ideal length. Since changing the utterance embedding is typically not something that is done super often, it's fine if the runtime for the switch gets a tiny bit worse, if that means that the performance gets better. I will try to get it done in the next few days and then it really doesn't matter anymore what you put in.

Ca-ressemble-a-du-fake commented 2 weeks ago

Thanks Florian for your advice! Will test your update when it's ready. To get better results should I finetune the model with a dataset of the target speaker voice, or the model will also fight to reproduce the target speaker ?

By the way are you planning on improving voice-cloning / mimicking ? What's lacking to get better results at voice-cloning ?