auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
MIT License
976 stars 207 forks source link

How to change speaker encoder to one-hot encoder #102

Open Jwaminju opened 2 years ago

Jwaminju commented 2 years ago

Hi, I'm interested in this project, and I'm looking forward to run this with my Korean audio files. But I'm undergraduated student with less knowledge about audio processing programming.

I've read a lot of issues in this repo, but I was confused.. so I uploaded this issue. The Zero shot model demo got result, but I want to run AutoVC-One-Hot to compare. Maybe I have to change file to use one-hot encoder. I tried to change speaker encoder to one-hot using tf.one_hot, but the print log of the variable, emb's shape(which was [1, 128, 80, 256]) was not same with the result of C(melsp)(whish was [1, 256]) I used the data same as demo wavs file.


Could you help me how to code the one-hot encodings? Thank you.

yenebeb commented 2 years ago

Hi @Jwaminju,

Not sure if you still need it, but this might be helpful for anyone looking to do the same.

The emb variable is indeed the right one to change. The embeddings currently used are created using the GE2E loss. To change this to an one-hot embedding you can simply create a zero-filled array and give each speaker its own id. This can be done in the file.

I haven't tested it (and wrote this quite quickly), but something like this should work. replace:

for speaker in sorted(subdirList):
    print('Processing speaker: %s' % speaker)
    utterances = []
    _, _, fileList = next(os.walk(os.path.join(dirName,speaker)))
    # make speaker embedding
    assert len(fileList) >= num_uttrs
    idx_uttrs = np.random.choice(len(fileList), size=num_uttrs, replace=False)
    embs = []


# use unemerate to get index
for i, speaker in enumerate( sorted(subdirList)):
    print('Processing speaker: %s' % speaker)
    utterances = []

    # -----
    # one hot embedding
    # create zero array of shape (256,), note that this shape is right since squeeze effectivly changes the (1,256) into shape (256,)
    emb = np.zeros(256, dtype=np.float32)
    # set speaker id
    emb[i] = 1

The whole second for loop can be removed here since we don't need the mel spectogram to create the embeddings anymore.

If you have more than 256 speakers or you want to change the embedding size to match the number of speakers you have, you'll have to pass the --dim_emb parameter on main.

WGQ123-code commented 2 years ago

Hi, I'm interested in this project, and I'm looking forward to run this with my Korean audio files. But I'm undergraduated student with less knowledge about audio processing programming.

I've read a lot of issues in this repo, but I was confused.. so I uploaded this issue. The Zero shot model demo got result, but I want to run AutoVC-One-Hot to compare. Maybe I have to change file to use one-hot encoder. I tried to change speaker encoder to one-hot using tf.one_hot, but the print log of the variable, emb's shape(which was [1, 128, 80, 256]) was not same with the result of C(melsp)(whish was [1, 256]) I used the data same as demo wavs file.


Could you help me how to code the one-hot encodings? Thank you.

Hi @Jwaminju,

Not sure if you still need it, but this might be helpful for anyone looking to do the same.

The emb variable is indeed the right one to change. The embeddings currently used are created using the GE2E loss. To change this to an one-hot embedding you can simply create a zero-filled array and give each speaker its own id. This can be done in the file.

I haven't tested it (and wrote this quite quickly), but something like this should work. replace:

for speaker in sorted(subdirList):
    print('Processing speaker: %s' % speaker)
    utterances = []
    _, _, fileList = next(os.walk(os.path.join(dirName,speaker)))
    # make speaker embedding
    assert len(fileList) >= num_uttrs
    idx_uttrs = np.random.choice(len(fileList), size=num_uttrs, replace=False)
    embs = []


# use unemerate to get index
for i, speaker in enumerate( sorted(subdirList)):
    print('Processing speaker: %s' % speaker)
    utterances = []

    # -----
    # one hot embedding
    # create zero array of shape (256,), note that this shape is right since squeeze effectivly changes the (1,256) into shape (256,)
    emb = np.zeros(256, dtype=np.float32)
    # set speaker id
    emb[i] = 1

The whole second for loop can be removed here since we don't need the mel spectogram to create the embeddings anymore.

If you have more than 256 speakers or you want to change the embedding size to match the number of speakers you have, you'll have to pass the --dim_emb parameter on main.

Hi @yenebeb, It's a pleasure to read your comments. I need use the speaker embedding, too. For a particular speaker, we know the position of '1'. If i use one-hot embedding, whether this trainning is not necessary. I don't know if my understanding is correct. If not, please give me some guidance. Thanks.

WildFire212 commented 2 years ago

@yenebeb Thanks a lot for the comment! I went through most of the issues in the repo, this is the only one that gives some explanation about the one-hot encoder. I am still a bit confused. By removing the second loop we would totally remove the mel-spectograms? It would be really helpful if you can explain/point to a resource regarding this.

yenebeb commented 2 years ago

@WGQ123-code short answer, yes it's important to train with the one-hot embedding.

Somewhat longer answer: You replace the whole embedding with one-hot embeddings. The 'only' difference between training with one-hot and the embedding generated by the GE2E encoder is their 'accuracy'. GE2E tries to create same embeddings for the same speaker this means that there is some kind of information about the speaker in the embedding. By training on the GE2E embedding the model is trained to recognise this information and thus is able to also work on unseen data (zero-shot learning). With one-hot embeddings you remove said information and force the model to train on the mel. The model does have to know which voices (mels) are from the same speaker however, this is why you need the one-hot embedding during training time.

@WildFire212 Yes you do remove the mel-spectograms but if you look carefully you'll notice that they're only used to create the GE2E embedding. Since you want one-hot embeddings this is not needed and will speed up the process quite a bit. The mel's are actually created and saved when you run make_spect.

WildFire212 commented 2 years ago

@yenebeb Thank you for the clarification!

WGQ123-code commented 2 years ago

@yenebeb Thank you very much for your guidance! Wish you a happy life!