DigitalPhonetics / IMS-Toucan

Multilingual and Controllable Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart.
Apache License 2.0
1.17k stars 135 forks source link

Quality of exact style cloning #147

Closed raymond00000 closed 2 weeks ago

raymond00000 commented 1 year ago

Hi guys,

I am very impressed with the paper's idea and result. Also, thanks a lot for sharing the codes too.

I am trying the example by using Obama's voice to speak the demo page's sample: "Wow, what a beautiful day!" However, it seems to me the voice of the synthesized speech does not sound uttered by Obama.
I would like to seek your advice if I did wrongly, or how to make a better cloned audio.

I attached the ref audio and cloned audio in the attached zip. audios.zip

this is the code I tried.

uc = UtteranceCloner(model_id="Meta", device="cuda" if torch.cuda.is_available() else "cpu")
uc.tts.set_utterance_embedding("audios/20090307_Weekly_Address.0000.0001.wav")
uc.clone_utterance(path_to_reference_audio="audios/human.wav",
                   reference_transcription="Wow, what a beautiful day!",
                   filename_of_result="audios/obama_test_cloned_wow.wav",
                   clone_speaker_identity=False,
                   lang="en")    

Thanks!

Flux9665 commented 11 months ago

It looks like you're doing it correctly, there's just 2 issues: To make it sound like Obama, the reference would already need to have a similar speaking style to him, because the speaking style is taken entirely from the prosody reference audio. This has an even bigger impact than the voice on the ability to recognize the speaker. The second problem is that this toolkit is currently very bad at cloning voices unseen during training. It is possible, but it doesn't work well. I'm working on improving this for the last few months, but it's very challenging, so it will take more time.