[QUESTION] Why is `reset_attention_mask=False` by default?

Your question

When we want to make a training in LLMs with a lot of corpora, I understand that the usual approach is to introduce the documents with the following format: [doc 1] \<sep> [doc 2] \<sep> ... Until the context length is full. However, the intuitive way of optimizing that I see is using something that you call reset_attention_mask and you have implemented here.

What I did not expect is to find this attribute as False in most yaml configurations of open models. Examples:

https://github.com/NVIDIA/Megatron-LM/blob/main/examples/gpt3/gpt_config.yaml
https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml
https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_llama_config.yaml

While I understand that there might be some potential benefits to not using masking, I don't trivially see why it should be the default approach. I haven't found much on the internet on this topic, any information would be welcome!

NVIDIA / Megatron-LM

[QUESTION] Why is `reset_attention_mask=False` by default? #954