I am building a project that needs longformer. However, my attention_mask is a size of [batch, seq_len, seq_len], it is not a size of [batch, seq_len] as usual. I am really confused and do not know how to tackle it when I see these line of code:
https://github.com/allenai/longformer/blob/265314df581f8ec9e4cca98b27914583e8155905/longformer/longformer.py#L80-L91
I know when the computation step comes to SelfAttention, the attention_mask which has the size of [batch, seq_len] is extended as [: , None, None, :], but my [batch, seq_len, seq_len] attention_mask does not make sense to do squeeze() like this. I also read the source code of longformer in Transformers HuggingFace and run the code with my attention_mask, it also bring out error because of attention_mask's dimension from these lines of code:
transformers/models/longformer/modeling_longformer.py#L587-L597
# values to pad for attention probs
remove_from_windowed_attention_mask = (attention_mask != 0)[:, :, None, None]
# cast to fp32/fp16 then replace 1's with -inf
float_mask = remove_from_windowed_attention_mask.type_as(query_vectors).masked_fill(
remove_from_windowed_attention_mask, -10000.0
)
# diagonal mask with zeros everywhere and -inf inplace of padding
diagonal_mask = self._sliding_chunks_query_key_matmul(
float_mask.new_ones(size=float_mask.size()), float_mask, self.one_sided_attn_window_size
)
If I use my attention_mask, the remove_from_windowed_attention_mask will be the size of [batch, seq_len, 1, 1, seq_len] and the ValueError: too many values to unpack (expected 4) appears when executing these lines of code:
transformers/models/longformer/modeling_longformer.py#L802-L808
def _sliding_chunks_query_key_matmul(self, query: torch.Tensor, key: torch.Tensor, window_overlap: int):
"""
Matrix multiplication of query and key tensors using with a sliding window attention pattern. This
implementation splits the input into overlapping chunks of size 2w (e.g. 512 for pretrained Longformer) with an
overlap of size window_overlap
"""
batch_size, seq_len, num_heads, head_dim = query.size()
In short, at 2 source codes of LongformerSelfAttention, I always get into trouble because of my 3-dimensionalattention_mask. I would be grateful if you could help.
Hi,
I am building a project that needs
longformer
. However, myattention_mask
is a size of[batch, seq_len, seq_len]
, it is not a size of[batch, seq_len]
as usual. I am really confused and do not know how to tackle it when I see these line of code: https://github.com/allenai/longformer/blob/265314df581f8ec9e4cca98b27914583e8155905/longformer/longformer.py#L80-L91 I know when the computation step comes toSelfAttention
, theattention_mask
which has the size of[batch, seq_len]
is extended as[: , None, None, :]
, but my[batch, seq_len, seq_len]
attention_mask does not make sense to dosqueeze()
like this. I also read the source code oflongformer
in Transformers HuggingFace and run the code with my attention_mask, it also bring out error because of attention_mask's dimension from these lines of code: transformers/models/longformer/modeling_longformer.py#L587-L597If I use my attention_mask, the
remove_from_windowed_attention_mask
will be the size of[batch, seq_len, 1, 1, seq_len]
and theValueError: too many values to unpack (expected 4)
appears when executing these lines of code: transformers/models/longformer/modeling_longformer.py#L802-L808In short, at 2 source codes of
LongformerSelfAttention
, I always get into trouble because of my 3-dimensionalattention_mask
. I would be grateful if you could help.Thanks, Khang