How to deal with 3-dimensional attention_mask in LongformerSelfAttention

Hi,

I am building a project that needs longformer. However, my attention_mask is a size of [batch, seq_len, seq_len], it is not a size of [batch, seq_len] as usual. I am really confused and do not know how to tackle it when I see these line of code: https://github.com/allenai/longformer/blob/265314df581f8ec9e4cca98b27914583e8155905/longformer/longformer.py#L80-L91 I know when the computation step comes to SelfAttention, the attention_mask which has the size of [batch, seq_len] is extended as [: , None, None, :], but my [batch, seq_len, seq_len] attention_mask does not make sense to do squeeze() like this. I also read the source code of longformer in Transformers HuggingFace and run the code with my attention_mask, it also bring out error because of attention_mask's dimension from these lines of code: transformers/models/longformer/modeling_longformer.py#L587-L597

# values to pad for attention probs
remove_from_windowed_attention_mask = (attention_mask != 0)[:, :, None, None]

# cast to fp32/fp16 then replace 1's with -inf
float_mask = remove_from_windowed_attention_mask.type_as(query_vectors).masked_fill(
    remove_from_windowed_attention_mask, -10000.0
)
# diagonal mask with zeros everywhere and -inf inplace of padding
diagonal_mask = self._sliding_chunks_query_key_matmul(
    float_mask.new_ones(size=float_mask.size()), float_mask, self.one_sided_attn_window_size
)

If I use my attention_mask, the remove_from_windowed_attention_mask will be the size of [batch, seq_len, 1, 1, seq_len] and the ValueError: too many values to unpack (expected 4) appears when executing these lines of code: transformers/models/longformer/modeling_longformer.py#L802-L808

def _sliding_chunks_query_key_matmul(self, query: torch.Tensor, key: torch.Tensor, window_overlap: int):
        """
        Matrix multiplication of query and key tensors using with a sliding window attention pattern. This
        implementation splits the input into overlapping chunks of size 2w (e.g. 512 for pretrained Longformer) with an
        overlap of size window_overlap
        """
        batch_size, seq_len, num_heads, head_dim = query.size()

In short, at 2 source codes of LongformerSelfAttention, I always get into trouble because of my 3-dimensionalattention_mask. I would be grateful if you could help.

Thanks, Khang

allenai / longformer

How to deal with 3-dimensional attention_mask in LongformerSelfAttention #227