Closed iamanigeeit closed 2 years ago
I'm working on implementing prosody / emotion (which will be a separate module). Would it be better to split the speaker embedding in a separate module?
@iamanigeeit sum or concat is ok, it's not important but sum can be reduce a number of parameters :D.
Hmm.... but does it really reduce parameters? Currently, to sum with text embedding, it is 512-dim, but speaker embedding in the Prosody paper is 128-dim.
I think concat makes the text and speaker independent.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Hi @dathudeptrai , thanks for the work in creating and maintaining this repo!
In the below code from
tensorflow_tts/models/tacotron2.py
, classTFTacotronEmbeddings
:This doesn't seem correct. I think we are supposed to concat, not sum. From the paper "Towards End to End Prosody Transfer for Expressive Speech Synthesis with Tacotron", Section 3.1: