Closed michael-conrad closed 4 years ago
Is this the right approach?
I noticed that you have to code set to overwrite a checkpoints params if given an explicit param.
So I'm trying
python train-ga.py --checkpoint generated_switching --hyper_parameters generated_switching_cherokee6 --accumulation_size 5
After making sure the alphabets and languages from the checkpointed version are appended to versions in the new params file.
Ah, I am sorry for a late response, I forgot ...
Please include instructions on how to resume training starting with your 70k iteration weights. Is this the right approach? https://pytorch.org/tutorials/recipes/recipes/warmstarting_model_using_parameters_from_a_different_model.html
These are just weights and not checkpoints (so it is missing optimizer-related things and so on), but you can use them for initialization. Look at these lines. The last four lines are not relevant in this case, so you can remove them.
Would it be possible to add additional languages as part of a fine tuning process?
I originally wanted to include the "fine-tuning" feature, but the code became very complicated and I actually did not need it for my experiments. I removed all the code related to fine-tuning in this commit 6c603ef9b049dd85c57cbf186e2ede7839348f07. Check out the train.py
file.
The typical use case is probably that you fine-tune the multilingual model to a single new language or speaker. Things are complicated because you have to make sure that the alphabet, speakers etc. matches and decide what to do if not (which approach to initialization etc.). In the case of the generated model, you also (IMHO) want to freeze all the encoder parameters and fine-tune just the language and speaker embeddings and maybe also the decoder, but in the case of other models supported by the code, you want to freeze or train different parts ...
These are just weights and not checkpoints (so it is missing optimizer-related things and so on), but you can use them for initialization. Look at these lines. The last four lines are not relevant in this case, so you can remove them.
so, I can add a CLI option to do a "--with__weights" or similar, load the weights, but otherwise do everything as a new model?
if yes, would there be any advantage in starting with the previous parameters, then adding the additional language, so everything is in the same order or embedding?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi,
Just wanted to know if there has been any movement on this, and if there's a clearer path to fine-tuning the model with new languages / speakers now?
For example, if I wanted to add support for English without having to re-train, what parameters would I have to freeze / train to enable this?
Thanks!
Hello, I am sorry guys, no movement. The training script is also not very fine-tuning friendly :pensive:
Thanks for the reply!
I've been trying to adapt the current code for fine-tuning on the LJSpeech dataset, i.e, adding support for English and for the LJSpeech speaker.
My approach currently involves freezing all parameters of the character encoder using param.requires_grad=False, and just training only the language encoder and the speaker encoder. Since there is only one speaker in the LJSpeech dataset, I have even set multi_speaker to False to turn off the adversarial speaker classifier. My model has been training for around 2 days (150 epochs on only the LJSpeech dataset), and while speech is starting to be generated in the LJSpeech speaker's voice, the model appears to have lost all information about other speakers. Consequently, feeding in any speaker id produces speech only in the LJSpeech speaker's voice.
Does this approach seem right to you?
Ou, interesting!
Just to clarify ... Are you useing GeneratedConvolutionalEncoder
as the encoder? If so, how did you add English? Did you make the inner embedding bigger and trainable while fixing the rest of the encoder parameters?
Also, how do you load the pre-trained model or treat the speaker embeddings? Because if you set multi_speaker=False
the checkpoint has some extra paramters (and maybe the decoder expects larger inputs?)
Fixing decoder seems ok, but you cannot expect that the resulting voice will be exactly matching Linda. Maybe, you can try fine-tuning it too but with lower learning rate.
Hi,
So unfortunately, our fine-tuning experiments didn't work out. But we're trying another line of experiments wherein we're attempting to get a single English speaker to speak in another language (say for example, German). In this case, since the use-case employs only one English speaker, is it sufficient to train the model using English recordings of only the target speaker, and German recordings of multiple other speakers? I.e, am I right in concluding that recordings of multiple English speakers are unnecessary, since we wish to synthesise German speech in only one particular English voice?
Thanks!
Please include instructions on how to resume training starting with your 70k iteration weights.
Would it be possible to add additional languages as part of a fine tuning process?