Tomiinek / Multilingual_Text_to_Speech

An implementation of Tacotron 2 that supports multilingual experiments with parameter-sharing, code-switching, and voice cloning.
MIT License
830 stars 157 forks source link

Documentation Request: Include instructions on how to fine tune pre-existing weights #14

Closed michael-conrad closed 4 years ago

michael-conrad commented 4 years ago

Please include instructions on how to resume training starting with your 70k iteration weights.

Would it be possible to add additional languages as part of a fine tuning process?

michael-conrad commented 4 years ago

Is this the right approach?

https://pytorch.org/tutorials/recipes/recipes/warmstarting_model_using_parameters_from_a_different_model.html

michael-conrad commented 4 years ago

I noticed that you have to code set to overwrite a checkpoints params if given an explicit param.

So I'm trying

python train-ga.py --checkpoint generated_switching --hyper_parameters generated_switching_cherokee6 --accumulation_size 5

After making sure the alphabets and languages from the checkpointed version are appended to versions in the new params file.

Tomiinek commented 4 years ago

Ah, I am sorry for a late response, I forgot ...

Please include instructions on how to resume training starting with your 70k iteration weights. Is this the right approach? https://pytorch.org/tutorials/recipes/recipes/warmstarting_model_using_parameters_from_a_different_model.html

These are just weights and not checkpoints (so it is missing optimizer-related things and so on), but you can use them for initialization. Look at these lines. The last four lines are not relevant in this case, so you can remove them.

Would it be possible to add additional languages as part of a fine tuning process?

I originally wanted to include the "fine-tuning" feature, but the code became very complicated and I actually did not need it for my experiments. I removed all the code related to fine-tuning in this commit 6c603ef9b049dd85c57cbf186e2ede7839348f07. Check out the train.py file.

The typical use case is probably that you fine-tune the multilingual model to a single new language or speaker. Things are complicated because you have to make sure that the alphabet, speakers etc. matches and decide what to do if not (which approach to initialization etc.). In the case of the generated model, you also (IMHO) want to freeze all the encoder parameters and fine-tune just the language and speaker embeddings and maybe also the decoder, but in the case of other models supported by the code, you want to freeze or train different parts ...

michael-conrad commented 4 years ago

These are just weights and not checkpoints (so it is missing optimizer-related things and so on), but you can use them for initialization. Look at these lines. The last four lines are not relevant in this case, so you can remove them.

so, I can add a CLI option to do a "--with__weights" or similar, load the weights, but otherwise do everything as a new model?

if yes, would there be any advantage in starting with the previous parameters, then adding the additional language, so everything is in the same order or embedding?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

padmanabhankrishnamurthy commented 2 years ago

Hi,

Just wanted to know if there has been any movement on this, and if there's a clearer path to fine-tuning the model with new languages / speakers now?

For example, if I wanted to add support for English without having to re-train, what parameters would I have to freeze / train to enable this?

Thanks!

Tomiinek commented 2 years ago

Hello, I am sorry guys, no movement. The training script is also not very fine-tuning friendly :pensive:

padmanabhankrishnamurthy commented 2 years ago

Thanks for the reply!

I've been trying to adapt the current code for fine-tuning on the LJSpeech dataset, i.e, adding support for English and for the LJSpeech speaker.

My approach currently involves freezing all parameters of the character encoder using param.requires_grad=False, and just training only the language encoder and the speaker encoder. Since there is only one speaker in the LJSpeech dataset, I have even set multi_speaker to False to turn off the adversarial speaker classifier. My model has been training for around 2 days (150 epochs on only the LJSpeech dataset), and while speech is starting to be generated in the LJSpeech speaker's voice, the model appears to have lost all information about other speakers. Consequently, feeding in any speaker id produces speech only in the LJSpeech speaker's voice.

Does this approach seem right to you?

Tomiinek commented 2 years ago

Ou, interesting!

Just to clarify ... Are you useing GeneratedConvolutionalEncoder as the encoder? If so, how did you add English? Did you make the inner embedding bigger and trainable while fixing the rest of the encoder parameters? Also, how do you load the pre-trained model or treat the speaker embeddings? Because if you set multi_speaker=False the checkpoint has some extra paramters (and maybe the decoder expects larger inputs?)

Fixing decoder seems ok, but you cannot expect that the resulting voice will be exactly matching Linda. Maybe, you can try fine-tuning it too but with lower learning rate.

padmanabhankrishnamurthy commented 2 years ago

Hi,

So unfortunately, our fine-tuning experiments didn't work out. But we're trying another line of experiments wherein we're attempting to get a single English speaker to speak in another language (say for example, German). In this case, since the use-case employs only one English speaker, is it sufficient to train the model using English recordings of only the target speaker, and German recordings of multiple other speakers? I.e, am I right in concluding that recordings of multiple English speakers are unnecessary, since we wish to synthesise German speech in only one particular English voice?

Thanks!