casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.64k stars 196 forks source link

[BUG] Quantizing GPT NeoX raises an error #322

Open kevin3314 opened 7 months ago

kevin3314 commented 7 months ago

First of all, thank you for great work.

System info

autoawq==0.1.8

Details

While I tried to quantize GPT NeoX model, encountered the error below.

>>> from awq import AutoAWQForCausalLM
>>> from transformers import AutoTokenizer
>>> model_path = 'EleutherAI/gpt-neox-20b'
>>> quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
>>> model = AutoAWQForCausalLM.from_pretrained(model_path)
>>> tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
>>> model.quantize(tokenizer, quant_config=quant_config)
Generating validation split: 214670 examples [00:09, 22942.50 examples/s]
AWQ:   0%|                                                                                                                                                          | 0/44 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/awq/models/base.py", line 93, in quantize
    quantizer.quantize()
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 95, in quantize
    input_feat = self._get_input_feat(self.modules[i], named_linears)
  File "/usr/local/lib/python3.10/dist-packages/awq/quantize/quantizer.py", line 406, in _get_input_feat
    self.inps = layer(self.inps, **module_kwargs)[0]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 690, in forward
    attention_layer_outputs = self.attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 213, in forward
    attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 287, in _attn
    attn_scores = attn_scores + attention_mask
RuntimeError: The size of tensor a (512) must match the size of tensor b (60) at non-singleton dimension 2

After digging into codes, it turned out that this line broke. GPT NeoX model transforms attention mask shaped (batch_size, seq_len) into (batch_size, 1, 1, seq_len) before reaching GPTNeoXLayer. On quantizing, input to GPTNeoXModel does not include attention_mask so transformation does not occur. After that, attention mask, whose shape is not compatible with GPTNeoXLayer, is appended in the above line, passed to it, and the error is raised.

I think that prepare_inputs_for_generation should be called before feeding to modules[0] since it is the same as in model.generate() in transformer.

TianjinYellow commented 7 months ago

Hi, kevin, i encountered the same problem, did you solve it?

casper-hansen commented 7 months ago

This kind of transient issue has been popping up every since transformers 4.36 was released. Unfortunately, the code since transformers 4.36 is unfavorable to handle these issues around input arguments in general. For now, I am not sure how to solve these errors in general. One method would be to pip install transformers<4.36.0 and find an AutoAWQ version that works with it.

CC @younesbelkada

codelion commented 7 months ago

I had the same issue today, installing transformers 4.35.2 seems to have worked.