Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.68k stars 169 forks source link

attention mask not used #43

Closed wj210 closed 1 year ago

wj210 commented 1 year ago

Hi, thanks for this piece of work.

I would like to understand why in the fine-tuning code, the attention_mask is not used?

with autocast_ctx: c_loss = model(examples, labels, images=imgs)

Would it not affect the attention to the padding tokens?

Secondly, would padding to longest,potentially speed up and reduce memory usage? rather than padding each input to max length.

Thanks

ChrisLiu6 commented 1 year ago

why in the fine-tuning code, the attention_mask is not used?

We use causal attention, and sequences are padded on the right, so word tokens would not attend to padding tokens.

would padding to longest, potentially speed up and reduce memory usage? rather than padding each input to max length.

Yes, I think this will improve some efficiency. However, currently, such optimization is not of high priority to us. If you're interested, you're welcome to submit any code contributions.

wj210 commented 1 year ago

thanks!