Can someone help me out with implementing graves attention layer?

artificertxj1 commented 4 years ago

I'm trying to implement graves (GMM) attention based on Mozilla TTS repo. Here is a link with brief discussions about the implementation by the repo holder (https://erogol.com/two-methods-for-better-attention-in-tacotron/). Code below is my implementation to fit flowtron. When I train it using single flow, it just doesn't work well. Only the first frame alignment is close to maximum value (which is 0.5 in graves attn instead of 1.0) and other frame attn. scores are really low. Also, the converging speed is slow. Can someone help me out to see which part of the code needs a fix?

class GravesAttention(torch.nn.Module):

    def __init__(self, n_mel_channels=80, n_speaker_dim=128,
                 n_text_channels=512, n_att_channels=256, K=4):
        super(GravesAttention, self).__init__()
        ## K is number of gaussian component
       self.K = K
       self._mask_value = 1e-8
       self.eps = 1e-5
       self.J = None
       self.N_a = nn.Sequential(
            nn.Linear(n_mel_channels, n_mel_channels, bias=True),
            nn.ReLU(),
            nn.Linear(n_mel_channels, 3 * K, bias=True)
         )
       self.key = LinearNorm(n_text_channels + n_speaker_dim,
                          n_att_channels, bias=False, w_init_gain='tanh')
       self.value = LinearNorm(n_text_channels + n_speaker_dim,
                            n_att_channels, bias=False,
                            w_init_gain='tanh')
       self.init_layers()

    def init_layers(self):
        torch.nn.init.constant_(self.N_a[2].bias[(2 * self.K):(3 * self.K)], 1.)  # bias mean
        torch.nn.init.constant_(self.N_a[2].bias[self.K:(2 * self.K)], 10)  # bias std

    def init_states(self, inputs):
        if self.J is None or inputs.shape[0] + 1 > self.J.shape[-1]:
            self.J = torch.arange(0, inputs.shape[0] + 2.0).to(inputs.device) + 0.5

    def forward(self, queries, keys, values, mask=None, attn=None):

        self.init_states(keys)     ##initialize self.J
        if attn is None:
            keys = self.key(keys).transpose(0, 1)  # B x in_lens x n_attn_channels

            values = self.value(values) if hasattr(self, 'value') else values
            values = values.transpose(0, 1)        # B x in_lens x n_attn_channels

            gbk_t = self.N_a(queries).transpose(0, 1) # B x T x 3K

            gbk_t = gbk_t.view(gbk_t.size(0), gbk_t.size(1), -1, self.K)
            # each B x T x K
            g_t = gbk_t[:, :, 0, :]
            b_t = gbk_t[:, :, 1, :]
            k_t = gbk_t[:, :, 2, :]

            g_t = torch.nn.functional.dropout(g_t, p=0.5, training=self.training)

            sig_t = torch.nn.functional.softplus(b_t) + self.eps
            k_t = torch.nn.functional.softplus(k_t)
            mu_t = torch.cumsum(k_t, dim=1) ## mu_t = mu_(t-1) + k_t, mu_0 = 0
            g_t = torch.softmax(g_t, dim=-1) + self.eps
            j = self.J[:values.size(1) + 1]

            phi_t = g_t.unsqueeze(-1) * (1 / (1 + torch.sigmoid((mu_t.unsqueeze(-1) - j) / 
                                                                              sig_t.unsqueeze(-1))))

            alpha_t = torch.sum(phi_t, 2)  ## sum over attn heads
            alpha_t = alpha_t[:, :, 1:] - alpha_t[:, :, :-1]

            alpha_t[alpha_t == 0] = 1e-8

            if mask is not None:
                 alpha_t.data.masked_fill_(mask.transpose(1, 2), self._mask_value)
        else:
            values = self.value(values)
            values = values.transpose(0, 1)
        print("with_dropout flows2 max, min in alpha_t {} {}".format(torch.max(alpha_t), torch.min(alpha_t)))
        output = torch.bmm(alpha_t, values)
        output = output.transpose(1, 2)

       return output, alpha_t

artificertxj1 commented 4 years ago

@Liujingxiu23 If you use the code in my post, you will that the layer learns to attend (not in a right way, but it shows a monotonic line) after 70k-80k iterations. The attention line is clear in the first few frames and quickly disappears for the following frames. To achieve a location sensitive attention, you need previous alignment information to accumulate through time steps. The attention used in flowtron is closer to the idea of attention-flow described in the paper of bi-directional attention flow (BiDAF). It's actually memoryless and uses only local information to calculate the alignment score (check the inputs of attention layer used in this model and inputs used in a standard Tacotron decoder attention cell, you will see what I mean). Also, after spending many hours on the code, I'm more confused by the statement of using flow in TTS. It looks like sampling from normal distribution and push random sampled vector through a flow-transformation are not a part of standard training. If you like to test the idea of flow based TTS model, I suggest you read GlowTTS paper. I personally thinks that their paper shows a better demonstration of using flow-transformation and give a great idea of making monotonic alignment in a flow based model.

Liujingxiu23 commented 4 years ago

@artificertxj1 Thank you for reply. I will train more steps and train with guided-attention-loss to see what happen. I will check the Glow-TTS to see how it relize alignment.

rafaelvalle commented 4 years ago

Flowtron uses Tacotron 1 attention. You can add location sensitive attention (Tacotron 2) to a pre-trained Flowtron model and it will improve attention.

With respect to normalizing flows, think of it as learning a mapping from the data distribution to a known distribution, for example a Gaussian distribution. The mapping is a chain of affine transformations that can be either autoregressive or bi-partite.

Glow-TTS applies a variant of the algorithm proposed in Align-TTS (https://arxiv.org/pdf/2003.01950.pdf) to normalizing flows.

artificertxj1 commented 4 years ago

Okay, I think that I finally understand of the point of flow transformation (meaning of log_s and b) when I try to rewrite the flow into a sequential manner and notice the difference between the Tacotron2 decoder output and flowtron output. I will try location sensitive attentions and let's see if it works better.

rafaelvalle commented 4 years ago

Yes, location sensitive attention does work better and should be used to fine-tune the model at the end otherwise training will take unnecessarily too long.

nicemanis commented 3 years ago

@artificertxj1 did you manage to get the Graves attention working with Flowtron?

artificertxj1 commented 3 years ago

@artificertxj1 did you manage to get the Graves attention working with Flowtron?

Nope, never made it work.

NVIDIA / flowtron

Can someone help me out with implementing graves attention layer? #69