Closed wj210 closed 1 year ago
why in the fine-tuning code, the attention_mask is not used?
We use causal attention, and sequences are padded on the right, so word tokens would not attend to padding tokens.
would padding to longest, potentially speed up and reduce memory usage? rather than padding each input to max length.
Yes, I think this will improve some efficiency. However, currently, such optimization is not of high priority to us. If you're interested, you're welcome to submit any code contributions.
thanks!
Hi, thanks for this piece of work.
I would like to understand why in the fine-tuning code, the attention_mask is not used?
with autocast_ctx: c_loss = model(examples, labels, images=imgs)
Would it not affect the attention to the padding tokens?
Secondly, would padding to longest,potentially speed up and reduce memory usage? rather than padding each input to max length.
Thanks