allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.03k stars 271 forks source link

Confusion about attention mask for pretraining LongformerForMaskedLM #218

Closed frederikkemarin closed 2 years ago

frederikkemarin commented 2 years ago

Hello, Due to difference in the documentation I have some confusion about the attention_mask input when pretraining LongformerForMaskedLM. According to https://huggingface.co/docs/transformers/model_doc/longformer#transformers.LongformerForMaskedLM the default is local attention (1 everywhere) when attention_mask = None. However, I get very different output logits in when I set:

model(input_ids, attention_mask = None)

vs when I set

model(input_ids, attention_mask=torch.ones(input_ids.shape)) # (i.e. ones everywhere for local attention). 

So what is the correct way to pretrain if I want local attention?