Open artificertxj1 opened 4 years ago
@Liujingxiu23 If you use the code in my post, you will that the layer learns to attend (not in a right way, but it shows a monotonic line) after 70k-80k iterations. The attention line is clear in the first few frames and quickly disappears for the following frames. To achieve a location sensitive attention, you need previous alignment information to accumulate through time steps. The attention used in flowtron is closer to the idea of attention-flow described in the paper of bi-directional attention flow (BiDAF). It's actually memoryless and uses only local information to calculate the alignment score (check the inputs of attention layer used in this model and inputs used in a standard Tacotron decoder attention cell, you will see what I mean). Also, after spending many hours on the code, I'm more confused by the statement of using flow in TTS. It looks like sampling from normal distribution and push random sampled vector through a flow-transformation are not a part of standard training. If you like to test the idea of flow based TTS model, I suggest you read GlowTTS paper. I personally thinks that their paper shows a better demonstration of using flow-transformation and give a great idea of making monotonic alignment in a flow based model.
@artificertxj1 Thank you for reply. I will train more steps and train with guided-attention-loss to see what happen. I will check the Glow-TTS to see how it relize alignment.
Flowtron uses Tacotron 1 attention. You can add location sensitive attention (Tacotron 2) to a pre-trained Flowtron model and it will improve attention.
With respect to normalizing flows, think of it as learning a mapping from the data distribution to a known distribution, for example a Gaussian distribution. The mapping is a chain of affine transformations that can be either autoregressive or bi-partite.
Glow-TTS applies a variant of the algorithm proposed in Align-TTS (https://arxiv.org/pdf/2003.01950.pdf) to normalizing flows.
Okay, I think that I finally understand of the point of flow transformation (meaning of log_s and b) when I try to rewrite the flow into a sequential manner and notice the difference between the Tacotron2 decoder output and flowtron output. I will try location sensitive attentions and let's see if it works better.
Yes, location sensitive attention does work better and should be used to fine-tune the model at the end otherwise training will take unnecessarily too long.
@artificertxj1 did you manage to get the Graves attention working with Flowtron?
@artificertxj1 did you manage to get the Graves attention working with Flowtron?
Nope, never made it work.
I'm trying to implement graves (GMM) attention based on Mozilla TTS repo. Here is a link with brief discussions about the implementation by the repo holder (https://erogol.com/two-methods-for-better-attention-in-tacotron/). Code below is my implementation to fit flowtron. When I train it using single flow, it just doesn't work well. Only the first frame alignment is close to maximum value (which is 0.5 in graves attn instead of 1.0) and other frame attn. scores are really low. Also, the converging speed is slow. Can someone help me out to see which part of the code needs a fix?