DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.43k stars 160 forks source link

multispeaker multilanguage finetuning #57

Closed ssolito closed 1 year ago

ssolito commented 1 year ago

Hello,

I would like to know how the trainingpipelines for a multi speaker and then for a multispeaker multilanguage model should be set? In the case of finetuning monolingual and monospeaker model, I set lang_embs=100, but in the case of multispeaker setting? What about in the case of a multilanguage and multispeaker model? Should I also enter the utt_embeds parameter?

Thanks!!! Sarah

bharaniyv commented 1 year ago

You should not change any parameters for Finetuning, even if your data is single speaker add your own pipeline in this file python run_training_pipeline.py with this file TrainingInterfaces/TrainingPipelines/FastSpeech2_finetuning_example.py as reference and add your dataset processor in Utility/path_to_transcript_dicts.py file and start the training depending on your data size you can train anywhere from 20K steps to 100K steps and you can expect decent results.

Note: If you are training a new language then remember to add new language details in Preprocessing/TextFrontend.py file by creating new ID for your language for more information you can refer the ReadMe file, it is explained in detail

ssolito commented 1 year ago

Hello,

thanks for your reply! I should have informed that currently I am working with the previous version of the toolkit (multi_language_multi_speaker). In that version, the example training pipeline indicated to use lang_embs=100 in order to finetune the model, as stated in the comment" # because we want to finetune it, we treat it as multilingual, even though we are only interested in German here"

Could you thus let me know whether any changes need to be made in the case of multispeaker and multilanguage models, as I mentioned at the beginning?

bharaniyv commented 1 year ago

No as I mentioned before even if you are finetuning the multi lingual and multi speaker model for a single language the lang_embs=100 should stay the same way, it will train in that new language and sound a bit better in that language but its multi-lingual and multi-speaker abilities will stay the same

tomschelsen commented 1 year ago

Hi, Same question but related to the fact that the finetuning function requires to set a language : https://github.com/DigitalPhonetics/IMS-Toucan/blob/94d6a5798ee31a9db13254c822df6dfc32339731/TrainingInterfaces/TrainingPipelines/FastSpeech2_finetuning_example.py#L57

I think as it is the from scratch meta training is intended to be multilingual, but not the finetuning.

bharaniyv commented 1 year ago

The design of this library is that even when you are finetuning a multi lingual model for a single language you can simply do so by just adding the single language datasets and language code in TextFrontend file (if it is a new language) and start training. The new model will work perfectly in that language, you cannot change any parameters in a pre-trained model if you do that you have to restart the training from scratch which is not recommended, so keep the parameters same and just finetune on new languages using the example file as reference

ssolito commented 1 year ago

Hello again,

thanks for the help provided so far. But what if I want to finetuning a multilingual model? That is, my model consists of two corpora: one Spanish and one Basque. Obviously I understand that the model will only be trained for one of these languages, one for Spanish and the other for Basque. Even in that case, should I just indicate Lang=embs=100?

Flux9665 commented 1 year ago

Hi! The amount of language embeddings has to be the same as the pretrained checkpoint that you are loading, otherwise you'll get an error saying that the amounts of parameters in the model and the amount of parameters in the checkpoint don't match. So if you want to train your model based on a pretrained model, you have to keep this the same, regardless of how many languages you want to use. We will try to increase the amount of supported languages drastically pretty soon, so the upcomming models will have even more language embeddings by default.

Also, you can finetune a multilingual model that can speak both Spanish and Basque, but I didn't prepare an example script for that yet. For training a single model to speak multiple languages, you would need to use the meta_train_loop rather than the fastspeech_train_loop script. I plan to unify those two into a common script where you can pass a list of datasets per language in the future, it's just not done yet.

The language parameter you pass in the train function is only related to the visualization that is generated to keep track of progress. other than that, you don't need to care about this language argument of the train_loop. If the datasets for the two languages are roughly equally sized, you can just merge them in a concat dataset, like in the example script, and train on them jointly in the regular finetuning script. Then you will also get a model that can speak both Spanish and Basque. The other script is only needed for unballanced dataset sizes.

krisbianprabowo commented 1 year ago

For training a single model to speak multiple languages, you would need to use the meta_train_loop rather than the fastspeech_train_loop script.

hey I'm wondering, in order to fine-tune the multilingual + multispeaker model should we actually use the meta pipeline too, or we simply can use our own modified finetuning_example.py?

=========

If the datasets for the two languages are roughly equally sized, you can just merge them in a concat dataset

one more question, just to make it clear with the above case you guys are discussing, we only need to set the parameter lang here https://github.com/DigitalPhonetics/IMS-Toucan/blob/94d6a5798ee31a9db13254c822df6dfc32339731/TrainingInterfaces/TrainingPipelines/FastSpeech2_finetuning_example.py#L57 to only a single language id even if we intend to fine-tuning it for a multilingual datasets right?

Flux9665 commented 1 year ago

Which train loop you choose only depends on the data that you want to train on now. Which model you use as starting point does not matter. So If you want to train on more than one language, you should use the meta pipeline, if you train on a single language, you can use the funetuning_example script. The next version of the toolkit will unify this so you don't have to choose one anymore, it will then detect the amount of languages and set up the correct choice of train loop in the background.

The language that you choose at this point is only for the progress plots that are being generated. So if it's "de", it will use the German language embedding and the German test sentence. In the past, multiple testing languages were supported, maybe I'll bring that feature back. I removed it for the sake of simplicity and only have plots for one main language, because I found that the model learns pretty evenly across all languages, so I ended up always looking at only one language anyway.

krisbianprabowo commented 1 year ago

thank you for the explanation, really appreciate it!

Flux9665 commented 1 year ago

Today's release introduces the train_loop_arbiter, which will choose the correct train loop to use for both mono-lingual and multi-lingual cases. I hope the use becomes clear from the finetuning_example.py script with the other training pipelines.