DigitalPhonetics / IMS-Toucan

Multilingual and Controllable Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart.
Apache License 2.0
1.37k stars 153 forks source link

Feature request - [New German Data Set Thorsten – 2022.10 Available for Re-Training #54

Closed eqikkwkp25-cyber closed 1 year ago

eqikkwkp25-cyber commented 1 year ago

Are there any plans to re-train and provide a new model based on the new German data set Thorsten 2022.10 which can be found here https://www.thorsten-voice.de/en/datasets-2/

Thanks.

Flux9665 commented 1 year ago

Thanks for the reminder!

The current model in the release section is trained on a bunch of different datasets, including version 1 of the Thorsten Dataset. I'll add the new dataset into the collection of joint datasets, so it will be a part of the next release. There will be no individual model trained on just this dataset alone, but the data will be included in the next iteration of the massively multilingual meta model. I'll also include some subsets of the emotional data (not the whispering, because whispering functions entirely different from regular speech because the vowels don't have phonation).

thorstenMueller commented 1 year ago

Thanks 😊. If i can support in some way, please let me know.

Flux9665 commented 1 year ago

Today's release includes models trained with the new Thorsten Dataset. It's part of the multilingual model (demo: https://huggingface.co/spaces/Flux9665/ThisSpeakerDoesNotExist), we will probably use the data some more as purely German pretraining for our German poetry synthesis projects at the Uni Stuttgart :)

thorstenMueller commented 1 year ago

If you need some poetry recordings from me, just let me know @Flux9665 😊.

eqikkwkp25-cyber commented 1 year ago

I am overwhelmed by 1000 artificial German voices :-) Are those associated to certain speakers like Thorsten and is there a list of names for the voice seeds or are those voices completely artificial in the sense that they are mixed up somehow?

Flux9665 commented 1 year ago

The voices are completely artificial. A separate generative model (a Wasserstein GAN) is trained to produce voices that don't exist. We did this for speaker-privacy reasons. Sometimes it can happen though, that a generated voice is similar to a voice that it has seen during training. But most of the time, the voices cannot be linked to any human with speaker verification or speaker identification models.

eqikkwkp25-cyber commented 1 year ago

I got the idea.

Unfortunately, restarting the application IMS-Toucan / ImprovedControllableMultilingual branch / run_gradio_demo.py gives me completely new random voice seeds so that i cannot choose those which i like for reusing purposes (the quality of the generated voices vary greatly) and there are not that many multilingual, multispeaker TTS models freely available. Can i make ImprovedControllableMultilingual in this sense "controllable" using some configuration parameters?

run_interactive_demo.py does not use the Wassserstein GAN but a PATH_TO_REFERENCE_SPEAKER audio wav for similar purposes, right?

Flux9665 commented 1 year ago

Yes, when you supply a reference audio, the TTS will try to speak similar to the reference audio. This works well for the voice, but not so good for the speaking style. The interactive demo and the controllable demo are just demos, they are not meant to be really useful applications, so many of their designs are not great. As of now there is no way to ensure you will get the same voice seeds again, the voices are always randomly generated. You can change this by saving the list of seed voices to a file and loading it rather than generating new ones after this line here:

https://github.com/DigitalPhonetics/IMS-Toucan/blob/5f1dce3ba60a7a8a5550a6327154f46376920c92/InferenceInterfaces/Controllability/GAN.py#L22

z_list is the list of voices that are generated in this current run.