Quality of exact style cloning

DigitalPhonetics / IMS-Toucan

Multilingual and Controllable Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart.

Apache License 2.0

1.17k stars 135 forks source link

Hi guys,

I am very impressed with the paper's idea and result. Also, thanks a lot for sharing the codes too.

I am trying the example by using Obama's voice to speak the demo page's sample: "Wow, what a beautiful day!" However, it seems to me the voice of the synthesized speech does not sound uttered by Obama.
I would like to seek your advice if I did wrongly, or how to make a better cloned audio.

I attached the ref audio and cloned audio in the attached zip. audios.zip

this is the code I tried.

uc = UtteranceCloner(model_id="Meta", device="cuda" if torch.cuda.is_available() else "cpu")
uc.tts.set_utterance_embedding("audios/20090307_Weekly_Address.0000.0001.wav")
uc.clone_utterance(path_to_reference_audio="audios/human.wav",
                   reference_transcription="Wow, what a beautiful day!",
                   filename_of_result="audios/obama_test_cloned_wow.wav",
                   clone_speaker_identity=False,
                   lang="en")

Thanks!

DigitalPhonetics / IMS-Toucan

Quality of exact style cloning #147