Open DatGuy1 opened 4 years ago
@DatGuy1 You'll unlearn the information from the old speakers over time. It shouldn't be returning "to a voice that sounds like it was in the original model" like you say above, but regardless you will want to train on the speakers you're adding at the same time.
I'm not quite sure I understand. If hypothetically I have a model that can imitate 3 speakers. Then I want to make it imitate 3 different speakers. Do I retrain it in its entirety from the start? Or can I train it for the 3 new speakers, in a way that it appends the knowledge about the new 3 to the knowledge about the old one.
Just an idea: Would it be possible to train two different models on two different speakers and then merge speaker_embedding.weight?
@DatGuy1 Yes, that is completely possible. You'd need the rest of the network to be as similar as possible between the two networks, and it might produce a reasonable output.
Might be good to have a play with https://github.com/CorentinJ/Real-Time-Voice-Cloning If that's your interest. RTVC (Real-Time-Voice-Cloning) shows the limitations of only having the embedding change when faced with completely new speakers (it kinda works but emotion and unique qualities suffer)
I think I'm looking for something more nuanced. There's also speakers I'd like to try out that I'd guess have quite a while of data which would make RTVC pretty bad.
Anyways, what weights are speaker-specific that aren't in speaker_embedding.weight? I'm guessing the TorchMoji one? What's that one called?
@DatGuy1 I use the pretrained model for TorchMoji so not related to the individual speakers.
There is another speaker embedding inside/before the encoder https://github.com/CookiePPP/codedump/blob/master/tacotron2-PPP-1.3.0/model.py#L428
encoder_speaker_embed_dim
in hparams controls that and can be set to 0 to disable/remove that layer. It is not a very important layer.
ps: I'm not having good alignment with Flow-TTS so I'm going to go a little off the deep end and write my own more complex architecture off of it. Disappointing as it seems to be quite fast, but whatever, at least I get to have some fun with the remains :man_shrugging:
edit: I should add, still learning Machine Learning so I might've made a mistake, though not sure where/how.
What effect does encoder_speaker_embed_dim have?
Well, I've tried to train two different models and then merge the encoder_speaker_embedding and speaker_embedding but it doesn't seem to work. I suppose I'll have to do something like batch 10 voices per model. Can't seem to think of a better solution.
Also, the 18800 checkpoint model is 281MB but the models I've trained are 832MB. Weird.
How do you generate the .npy alignments from the audio files?