auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
990 stars 205 forks source link

How to train in ONE-HOT pattern? #68

Open Lukelluke opened 3 years ago

Lukelluke commented 3 years ago

Hi, @auspicious3000 ,

I find all the files and issues, didn't find any description on 「How to train in one-hot pattern」, which u suggest we to train just in that mode, if we don't have the necessary to apply 「one-shot」performance.

Could any friend who have successfully trained Auto-VC in one-hot mode, not the embedding with pretrained speaker-encoder?

Hope to get any useful reply from u all !

All the best, Luke Huang

ruclion commented 3 years ago

Hi I have not trained one-hot version, but I have some idea to say~ The only difference between one-hot and speaker encoder version is: weather the speaker's embedding can be trained by AutoVC training process. How to train in one-hot pattern, may like this:

  1. get the number of total speakers, maybe 40
  2. set a lookup embedding table, like multi-speaker tacotron2
  3. every time get the sentences to train, the input is: mels for content encoder, not use speaker embedding, just sent speaker id as input, and among lookup embedding table, then get a trainable embedding vector, and concat this vector with content vector
  4. when gradient back, speaker's embedding vector will change alittle
  5. for all the training process, the same speaker has same embedding vector; like word embedding.

In fact, 「How to train in one-hot pattern」in author's mind may be just the most simple way to train model when face to multi-speaker problem, it's better than speaker encoder version because it's embedding can change by gradient , but speaker encoder's embedding can not.