Closed frederikkemarin closed 2 years ago
Hello, Due to difference in the documentation I have some confusion about the attention_mask input when pretraining LongformerForMaskedLM. According to https://huggingface.co/docs/transformers/model_doc/longformer#transformers.LongformerForMaskedLM the default is local attention (1 everywhere) when attention_mask = None. However, I get very different output logits in when I set:
model(input_ids, attention_mask = None)
vs when I set
model(input_ids, attention_mask=torch.ones(input_ids.shape)) # (i.e. ones everywhere for local attention).
So what is the correct way to pretrain if I want local attention?
Hello, Due to difference in the documentation I have some confusion about the attention_mask input when pretraining LongformerForMaskedLM. According to https://huggingface.co/docs/transformers/model_doc/longformer#transformers.LongformerForMaskedLM the default is local attention (1 everywhere) when attention_mask = None. However, I get very different output logits in when I set:
vs when I set
So what is the correct way to pretrain if I want local attention?