When we want to make a training in LLMs with a lot of corpora, I understand that the usual approach is to introduce the documents with the following format:
[doc 1] \<sep> [doc 2] \<sep> ...
Until the context length is full. However, the intuitive way of optimizing that I see is using something that you call reset_attention_mask and you have implemented here.
What I did not expect is to find this attribute as False in most yaml configurations of open models. Examples:
While I understand that there might be some potential benefits to not using masking, I don't trivially see why it should be the default approach. I haven't found much on the internet on this topic, any information would be welcome!
Your question
When we want to make a training in LLMs with a lot of corpora, I understand that the usual approach is to introduce the documents with the following format: [doc 1] \<sep> [doc 2] \<sep> ... Until the context length is full. However, the intuitive way of optimizing that I see is using something that you call
reset_attention_mask
and you have implemented here.What I did not expect is to find this attribute as False in most
yaml
configurations of open models. Examples:While I understand that there might be some potential benefits to not using masking, I don't trivially see why it should be the default approach. I haven't found much on the internet on this topic, any information would be welcome!