NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.13k stars 2.28k forks source link

[QUESTION] Why is `reset_attention_mask=False` by default? #954

Closed dtamayo-nlp closed 3 weeks ago

dtamayo-nlp commented 2 months ago

Your question

When we want to make a training in LLMs with a lot of corpora, I understand that the usual approach is to introduce the documents with the following format: [doc 1] \<sep> [doc 2] \<sep> ... Until the context length is full. However, the intuitive way of optimizing that I see is using something that you call reset_attention_mask and you have implemented here.

What I did not expect is to find this attribute as False in most yaml configurations of open models. Examples:

While I understand that there might be some potential benefits to not using masking, I don't trivially see why it should be the default approach. I haven't found much on the internet on this topic, any information would be welcome!