Why not using key_padding_mask in MultiHeadAttention?

qiaoran-dawnlight commented 3 years ago

Hi, i noticed you not using the key_padding_mask is the MultiHeadAttention. Which means the mask are not using in the Transformer? Mu guess is, from the original author, the mask have the img original size info before padding, so the transformer would know which part is real img and which part is padding from the mask. But you have fixed size input, transformer no need worried the img padding. If in that case, why not keep them? Any special reason?

 """
    if key_padding_mask is not None:
        attn_output_weights = tf.reshape(attn_output_weights,
                            [batch_size, self.num_heads, target_len, source_len])
        key_padding_mask = tf.expand_dims(key_padding_mask, 1)
        key_padding_mask = tf.expand_dims(key_padding_mask, 2)
        key_padding_mask = tf.tile(key_padding_mask, [1, self.num_heads, target_len, 1])
        #print("before attn_output_weights", attn_output_weights.shape)
        attn_output_weights = tf.where(key_padding_mask,
                                       tf.zeros_like(attn_output_weights) + float('-inf'),
                                       attn_output_weights)
        attn_output_weights = tf.reshape(attn_output_weights,
                            [batch_size * self.num_heads, target_len, source_len])
    """

thibo73800 commented 3 years ago

Hello,

Thanks for the question. A few weeks ago I was about to remove that part since it is never used by the training pipeline (that used fixed-sized input images). However, as mentioned here https://github.com/Visual-Behavior/detr-tensorflow/issues/10 it might be a good feature to have an alternative training pipeline with all the images padded as in the original implementation.

If the feature is useful and needs to be implemented I will go back through that code to check if everything is working properly with padded images.

Thibault

qiaoran-dawnlight commented 3 years ago

Okay, thanks for the explain. So no special reason :) I think the non-fix size training is actually one of key features of the original one (it might not seems highlight in paper), as the authors specially using the mask. This feature will make this project more suitable for other dataset or training from scratch.

Visual-Behavior / detr-tensorflow

Why not using key_padding_mask in MultiHeadAttention? #12