How long in seconds should the speaker reference audio be ?

Ca-ressemble-a-du-fake commented 1 year ago

Hi,

When using read_texts function how long should the speaker_reference be and how should it be to give best results ?

By "how long" I mean its duration in seconds and by "how" I mean what the message should be. For example if I want the tts output to be shouting will it work if I input a shouting sample for speaker_reference ? Same for whispering.

Looking forward to reading your answer :smile:

Flux9665 commented 1 year ago

The speaker reference should be somewhere between 6 and 12 seconds. More is ok, but a random window will be taken, so you don't gain anything from using a longer reference audio.

The speech should be close to what the model has been trained on, so if it hasn't seen shouting or whispering in it's training, it will probably not produce the desired results for such speech references. Also there should be no (long) pauses in the reference. The microphone quality and background noise will also be reflected significantly, the higher quality the reference, the better the results. What is said in the reference doesn't really matter, it can be anything. I heard from people doing voice conversion, that a phonetically ballanced utterance (e.g. the Harvard Sentences for English) can improve the results.

Toucan is not good at voice cloning for unseen speakers. It can adapt to a voice that sounds somewhat similar, but without much change in the speaking style. To a human listener, the difference will be very obvious, because the speaking style is very important to us.

Ca-ressemble-a-du-fake commented 1 year ago

Thank you, I will try again then since I've only tried with ~2 second long reference.

DigitalPhonetics / IMS-Toucan

How long in seconds should the speaker reference audio be ? #106