Now we can pass, only at inference time, the conditioned speaker (the speaker with expressive speech recorded) to conditionate the style encoder, duration and pitch predictor. After that, the desired speaker embedding will be used, as:

speaker_emb, cond_speaker emb

encoder_output = encoder_output + cond_speaker_embd

style_emb = style_encoder(encoder_output) encoder_output += style_emb

pitch_emb = pitch_predictor(encoder_output) encoder_output += pitch_emb

dur_emb = duration_predictor(encoder_output) encoder_output += dur_emb

Here we change the speaker embedding, removing the conditioned one and adding the desired one

encoder_output = encoder_output - cond_speaker_emb + speaker_emb

melspectrogram = decoder(encoder_output)

AI-Unicamp / TTS

Adding speaker condition to prosodic features #10

Here we change the speaker embedding, removing the conditioned one and adding the desired one