AI-Unicamp / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
6 stars 2 forks source link

Adding speaker condition to prosodic features #10

Closed lucashueda closed 1 year ago

lucashueda commented 1 year ago

Now we can pass, only at inference time, the conditioned speaker (the speaker with expressive speech recorded) to conditionate the style encoder, duration and pitch predictor. After that, the desired speaker embedding will be used, as:

speaker_emb, cond_speaker emb

encoder_output = encoder_output + cond_speaker_embd

style_emb = style_encoder(encoder_output) encoder_output += style_emb

pitch_emb = pitch_predictor(encoder_output) encoder_output += pitch_emb

dur_emb = duration_predictor(encoder_output) encoder_output += dur_emb

Here we change the speaker embedding, removing the conditioned one and adding the desired one

encoder_output = encoder_output - cond_speaker_emb + speaker_emb

melspectrogram = decoder(encoder_output)