OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
626 stars 49 forks source link

attention_mask may appear None for newer versions of LLaMA? #46

Closed Alvant closed 6 months ago

Alvant commented 7 months ago

Updated some libs recently, and today received an error in line https://github.com/OpenGVLab/OmniQuant/blob/main/quantize/omniquant.py#L164

attention_mask_batch = attention_mask.repeat(args.batch_size,1,1,1) if args.deactive_amp else attention_mask.repeat(args.batch_size,1,1,1).float()

which told that attention_mask was None, and so had not repeat method.

Seems that LLaMA may operate without attention_mask. At least in some cases. Here https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1032 is the code which does not use attention_mask. And exactly this code is executed if there is no info about attention in the model config.json file (seems that SDPA attention may be automatically selected instead of an "Eager" one: https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L1336).

If there really could be some kind of a bug, I think there might be several possible things one can do about it. First, just set attention_mask_batch also None if attention_mask is None:

if attention_mask is not None:
    attention_mask_batch = attention_mask.repeat(args.batch_size,1,1,1) if args.deactive_amp else attention_mask.repeat(args.batch_size,1,1,1).float()
else:
    attention_mask_batch = None

However, this could change the experiment results for LLaMA models because previously attention_mask was in use. So, we can also make sure that Eager attention is used if nothing specified in the config.json (https://github.com/OpenGVLab/OmniQuant/blob/main/models/LMClass.py#L23):

config = AutoConfig.from_pretrained(args.model)

if getattr(config, '_attn_implementation_internal', None) is None:
    config._attn_implementation_internal = 'eager'

P.S. I would be ready to make a PR with a fix, if there is really a need for some :slightly_smiling_face:

joseph777111 commented 6 months ago

Thanks for the quick fix. You made my night! Merry Christmas!! :)

ChenMnZ commented 6 months ago

@Alvant Thanks for your detailed explanation.

You can make a PR to fix this bug. This can enhance the generalization of this codebase.

Thanks in advance for your time!

Alvant commented 6 months ago

@ChenMnZ here is a PR: https://github.com/OpenGVLab/OmniQuant/pull/49. I had to change attn_implementation logic a bit) More details in the PR description.