Closed Alvant closed 6 months ago
Thanks for the quick fix. You made my night! Merry Christmas!! :)
@Alvant Thanks for your detailed explanation.
You can make a PR to fix this bug. This can enhance the generalization of this codebase.
Thanks in advance for your time!
@ChenMnZ here is a PR: https://github.com/OpenGVLab/OmniQuant/pull/49. I had to change attn_implementation
logic a bit) More details in the PR description.
Updated some libs recently, and today received an error in line https://github.com/OpenGVLab/OmniQuant/blob/main/quantize/omniquant.py#L164
which told that
attention_mask
was None, and so had notrepeat
method.Seems that LLaMA may operate without attention_mask. At least in some cases. Here https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1032 is the code which does not use attention_mask. And exactly this code is executed if there is no info about attention in the model config.json file (seems that SDPA attention may be automatically selected instead of an "Eager" one: https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L1336).
If there really could be some kind of a bug, I think there might be several possible things one can do about it. First, just set
attention_mask_batch
also None ifattention_mask
is None:However, this could change the experiment results for LLaMA models because previously attention_mask was in use. So, we can also make sure that Eager attention is used if nothing specified in the config.json (https://github.com/OpenGVLab/OmniQuant/blob/main/models/LMClass.py#L23):
P.S. I would be ready to make a PR with a fix, if there is really a need for some :slightly_smiling_face: