Closed lucashueda closed 1 year ago
Now we can pass, only at inference time, the conditioned speaker (the speaker with expressive speech recorded) to conditionate the style encoder, duration and pitch predictor. After that, the desired speaker embedding will be used, as:
speaker_emb, cond_speaker emb
encoder_output = encoder_output + cond_speaker_embd
style_emb = style_encoder(encoder_output) encoder_output += style_emb
pitch_emb = pitch_predictor(encoder_output) encoder_output += pitch_emb
dur_emb = duration_predictor(encoder_output) encoder_output += dur_emb
encoder_output = encoder_output - cond_speaker_emb + speaker_emb
melspectrogram = decoder(encoder_output)
Now we can pass, only at inference time, the conditioned speaker (the speaker with expressive speech recorded) to conditionate the style encoder, duration and pitch predictor. After that, the desired speaker embedding will be used, as:
speaker_emb, cond_speaker emb
encoder_output = encoder_output + cond_speaker_embd
style_emb = style_encoder(encoder_output) encoder_output += style_emb
pitch_emb = pitch_predictor(encoder_output) encoder_output += pitch_emb
dur_emb = duration_predictor(encoder_output) encoder_output += dur_emb
Here we change the speaker embedding, removing the conditioned one and adding the desired one
encoder_output = encoder_output - cond_speaker_emb + speaker_emb
melspectrogram = decoder(encoder_output)