GMM attention as alternative to Location Sensitive

Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation

MIT License

2.26k stars 907 forks source link

GMM attention as alternative to Location Sensitive #265

Open alexdemartos opened 5 years ago

alexdemartos commented 5 years ago

Hi,

following DeepMind's paper "Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron", they claim to use GMM attention as it generalize better for longer utterances. I have experimented myself these attention issues with long sentences in many of the trained models using the original Location Sensitive Attention.

I have tried to implement GMM attention by following the code found in: https://github.com/keithito/tacotron/issues/136

But got the following error:

Exiting due to exception: Incompatible shapes: [10,118] vs. [10,16] ... Caused by op 'Tacotron_model/inference/decoder/while/CustomDecoderStep/GmmAttention/add', defined at: ... kappa = tf.expand_dims(previous_kappa + tf.exp(kappa_hat), axis=2)

It would be great if someone more experienced could tell me how to modify the code from @bfs18 to work with this repo.

Thanks in advance.

dathudeptrai commented 5 years ago

hi man, I think you need modify TacotronDecoderCell in this repo, you must add new variable state in TacotronDecoderCellState and state = self._attention_mechanism.initial_state(batch_size, dtype). then you need modify context_vector, alignments, next_state = _compute_attention(self._attention_mechanism, LSTM_output, state.state) and update state = next_state for next time step. In my experience with GMM attention, because it is location-based attention (it doesn't care encoder hidden state to calculate alignment, it just care about hidden state in decoder), so the alignment graph very strong (especially with long sentences), but sound synthesize is not as good as location sensitive attention. if you want better sound generated, you need to do it in some way so that it gets the information of encoder hidden state to calculate alpha_hat, beta_hat and kappa_hat, I did and succeeded. alignment-temp

alexdemartos commented 5 years ago

Hi @dathudeptrai ,

first, thank you very much for your response. Indeed, your alignment graph looks really strong. Default attention + dropout is giving very random results for a production system (some sentences are excellent, some are good enough, some are just noise).

Any chance you have this code publicly available? In any case, thanks for your advice.

johnnyxuan commented 5 years ago

Hi, @alexdemartos have you solved the problem you mentioned using GMM attention in this repo? could you give me some sugesstion to solve the problem?

alexdemartos commented 5 years ago

Hi @johnnyxuan ,

indeed. As @dathudeptrai pointed out, you just need to change _zerostate method of TacotronDecoderCell (Architecture_wrappers.py). If using the GmmAttention class, the alignments key becomes:

alignments=self._attention_mechanism.initial_state(batch_size, tf.float32)

Unfortunately I am experiencing some issues using the implementation mentioned in my first post. These are the alignments I am getting:

step-30000-align step-10000-align

As you can notice, there is a random-constant attention on the first encoder embedding. I have to debug the code to find out what might be causing this.

johnnyxuan commented 5 years ago

@alexdemartos thanks very much! I changed the code and it works,

HaiFengZeng commented 5 years ago

@johnnyxuan @alexdemartos any idea about gets the information of encoder hidden state to calculate alpha_hat, beta_hat and kappa_hat as @dathudeptrai mentioned above? I find it's hard to have query([batch_size,query_depth]) to work with encoder outputs ([batch_size,max_time,key_depth]).