Why sum input and speaker embeddings?

iamanigeeit commented 3 years ago

Hi @dathudeptrai , thanks for the work in creating and maintaining this repo!

In the below code from tensorflow_tts/models/tacotron2.py, class TFTacotronEmbeddings:

        # sum all embedding
        embeddings += extended_speaker_features

This doesn't seem correct. I think we are supposed to concat, not sum. From the paper "Towards End to End Prosody Transfer for Expressive Speech Synthesis with Tacotron", Section 3.1:

For each example, the d S -dimensional speaker embedding corresponding to the true speaker of the example is broadcast-concatenated with the L T × d T -dimensional transcript encoder representation to form a (d T + d S )-dimensional sequence of encoder embeddings that the decoder will attend to.

iamanigeeit commented 3 years ago

I'm working on implementing prosody / emotion (which will be a separate module). Would it be better to split the speaker embedding in a separate module?

dathudeptrai commented 3 years ago

@iamanigeeit sum or concat is ok, it's not important but sum can be reduce a number of parameters :D.

iamanigeeit commented 3 years ago

Hmm.... but does it really reduce parameters? Currently, to sum with text embedding, it is 512-dim, but speaker embedding in the Prosody paper is 128-dim.

I think concat makes the text and speaker independent.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

TensorSpeech / TensorFlowTTS

Why sum input and speaker embeddings? #628