DeepRNN / image_captioning

Tensorflow implementation of "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention"
MIT License
780 stars 351 forks source link

Attention Formula dismatch with the implementation #28

Open HrsPythonix opened 6 years ago

HrsPythonix commented 6 years ago

I found that in the paper, the formula of MLP attention is usually desribed as below:

image

where vi is i-th feature map,ht is the output of lstm.

But in the code, the implementation goes like this:

    def attend(self, contexts, output):
        """ Attention Mechanism. """
        config = self.config
        reshaped_contexts = tf.reshape(contexts, [-1, self.dim_ctx])
        reshaped_contexts = self.nn.dropout(reshaped_contexts)
        output = self.nn.dropout(output)
        if config.num_attend_layers == 1:
            # use 1 fc layer to attend
            logits1 = self.nn.dense(reshaped_contexts,
                                    units = 1,
                                    activation = None,
                                    use_bias = False,
                                    name = 'fc_a')
            logits1 = tf.reshape(logits1, [-1, self.num_ctx])
            logits2 = self.nn.dense(output,
                                    units = self.num_ctx,
                                    activation = None,
                                    use_bias = False,
                                    name = 'fc_b')
            logits = logits1 + logits2
        else:
            # use 2 fc layers to attend
            temp1 = self.nn.dense(reshaped_contexts,
                                  units = config.dim_attend_layer,
                                  activation = tf.tanh,
                                  name = 'fc_1a')
            temp2 = self.nn.dense(output,
                                  units = config.dim_attend_layer,
                                  activation = tf.tanh,
                                  name = 'fc_1b')
            temp2 = tf.tile(tf.expand_dims(temp2, 1), [1, self.num_ctx, 1])
            temp2 = tf.reshape(temp2, [-1, config.dim_attend_layer])
            temp = temp1 + temp2
            temp = self.nn.dropout(temp)
            logits = self.nn.dense(temp,
                                   units = 1,
                                   activation = None,
                                   use_bias = False,
                                   name = 'fc_2')
            logits = tf.reshape(logits, [-1, self.num_ctx])
        alpha = tf.nn.softmax(logits)
        return alpha

Here I only consider the 2-fc branch. I think the fomula of the code is : wa(tanh(Wva vi) + tanh(Wha ht)), which is slightly different with the paper. But tanh(A) + tanh(B) != tanh(A+B)

So I wonder if there could be some problems that this difference may cause. Anyone can help?

Yuli555 commented 5 years ago

@HrsPythonix Hi! Have you solved the problem?