casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.45k stars 164 forks source link

Llama-3 support #450

Closed maziyarpanahi closed 2 months ago

maziyarpanahi commented 2 months ago

I am not able to quantized these new Llama-3 models:

AWQ:   3%|███████▊                                                                                                                                                                                                                                                   | 1/32 [00:34<18:04, 34.98s/it]
Traceback (most recent call last):
  File "/home/maziyar/anaconda3/envs/autoawq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/maziyar/anaconda3/envs/autoawq/lib/python3.10/site-packages/awq/models/base.py", line 177, in quantize
    self.quantizer.quantize()
  File "/home/maziyar/anaconda3/envs/autoawq/lib/python3.10/site-packages/awq/quantize/quantizer.py", line 147, in quantize
    input_feat = self._get_input_feat(self.modules[i], named_linears)
  File "/home/maziyar/anaconda3/envs/autoawq/lib/python3.10/site-packages/awq/quantize/quantizer.py", line 535, in _get_input_feat
    self.inps = layer(self.inps, **module_kwargs)[0]
  File "/home/maziyar/anaconda3/envs/autoawq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/maziyar/anaconda3/envs/autoawq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/maziyar/anaconda3/envs/autoawq/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 740, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/maziyar/anaconda3/envs/autoawq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/maziyar/anaconda3/envs/autoawq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/maziyar/anaconda3/envs/autoawq/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 662, in forward
    causal_mask = causal_mask[:, :, cache_position, : key_states.shape[-2]]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)
catid commented 2 months ago
CUDA_VISIBLE_DEVICES=0 python quantize.py
casper-hansen commented 2 months ago

I quantized Llama 3 8B on a single 4090. Additionally, Llama 3 70B was quantized with multiple 48 GB GPUs. I’m not sure how to reproduce this as I didn’t experience the same error

casper-hansen commented 2 months ago

Closing this as Llama 3 is definitely already supported. Using the examples/quantize.py worked without any modifications in the first try for both sizes of the model.

https://huggingface.co/casperhansen/llama-3-8b-instruct-awq https://huggingface.co/casperhansen/llama-3-70b-instruct-awq

maziyarpanahi commented 2 months ago

It's not a question of vRAM, mine failed on 4 A100/80G. Usually, quantizing models via AWQ uses multiple GPUs and it's quick this way. I am not sure CUDA_VISIBLE_DEVICES=0 would be as fast, but it's a workaround. Thank you.

PS: I use AWQ via huggingface transformers, so it is possible there is something in there too.