casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.62k stars 193 forks source link

error when quantizing my finetuned 405b model using autoawq #571

Open Atomheart-Father opened 1 month ago

Atomheart-Father commented 1 month ago

Package Version: AutoAWQ: 0.2.5+cu118 torch: 2.3.1+cu118 transformers: 4.43.3

I was try to quantize my finetuned llama3.1 405b (bf16) model to 4 bit using autoawq following the insturction in the huggingface https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 the code I used : image

Exception: Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 191/191 [00:18<00:00, 10.47it/s] start quant /home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by promote_options='default'. table = cls._concat_blocks(blocks, axis=0) Traceback (most recent call last): File "/home/lantu_mp/common/Firefly/script/awq_quan.py", line 21, in <module> model.quantize(tokenizer, quant_config=quant_config) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/awq/models/base.py", line 170, in quantize self.quantizer = AwqQuantizer( File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/awq/quantize/quantizer.py", line 61, in __init__ self.modules, self.module_kwargs, self.inps = self.init_quant() File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/awq/quantize/quantizer.py", line 482, in init_quant self.model(samples.to(next(self.model.parameters()).device)) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1141, in forward outputs = self.model( File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 920, in forward position_embeddings = self.rotary_emb(hidden_states, position_ids) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 153, in forward freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

I have enough cpu ram and 8 A800 GPU,due to the model card page, which is enough for quantizing this model image

Atomheart-Father commented 1 month ago

I posted this problem in hugging-quants discussion page, they recommend me to open an issue here. https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4/discussions/13

casper-hansen commented 1 month ago

Try to use the device_map argument when creating the model. Recently, HF made some changes to loading that is causing this issue

Atomheart-Father commented 1 month ago

Try to use the device_map argument when creating the model. Recently, HF made some changes to loading that is causing this issue

I only have 8 A800, which is not enough for loading this model. So I cannot use device_map = auto. If I set this argument to cpu, this exception occured again. I want to know how can I use 1 single a800 to quantize this model as you mentioned on the huggingface page. image

casper-hansen commented 1 month ago

I would encourage you to look into how you can effectively use accelerate since AutoAWQ relies on this library to load transformers models. Specifically, you can design device_map for large models by specifying exactly where each layer should be loaded. See the bottom of Designing a device map:

https://huggingface.co/docs/accelerate/v0.33.0/en/concept_guides/big_model_inference#designing-a-device-map

You should be able to inspect the current device_map like this:

AutoAWQForCausalLM.from_pretrained(...).model.hf_device_map

Additionally, I would also encourage you to experiment with how to load such large models using AutoModelForCausalLM.from_pretrained and run inference because the AutoAWQ from_pretrained just wraps this call to load the model in for quantization.

Atomheart-Father commented 1 month ago

I would encourage you to look into how you can effectively use accelerate since AutoAWQ relies on this library to load transformers models. Specifically, you can design device_map for large models by specifying exactly where each layer should be loaded. See the bottom of Designing a device map:

https://huggingface.co/docs/accelerate/v0.33.0/en/concept_guides/big_model_inference#designing-a-device-map

You should be able to inspect the current device_map like this:

AutoAWQForCausalLM.from_pretrained(...).model.hf_device_map

Additionally, I would also encourage you to experiment with how to load such large models using AutoModelForCausalLM.from_pretrained and run inference because the AutoAWQ from_pretrained just wraps this call to load the model in for quantization.

But the 405B model needs more than 800G of VRAM to load it, I only got 480G. Do you mean that I can avoid this error by manually loading some layers into the gpu and other layers into the cpu?

casper-hansen commented 1 month ago

The reason you are running into OOM is that layers should be on the same device, but device_map=auto or device_map=None loads the layers in such a way that a single layer can have multiple devices (e.g. CPU and cuda:0). To prevent this, you can specify a device_map that loads as many layers onto GPUs as possible and the rest onto CPU. That way, a single layer only has a single device which should avoid the error you are seeing above.

So to be clear, 480GB VRAM is fine if you have 400GB of system RAM.

Atomheart-Father commented 1 month ago

I wanted to inspect my current device_map, and tried to manually specifying the loadding.

When I tried model = AutoAWQForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, use_cache=False, device_map='cpu' ) print(model.model.hf_device_map)

It returned Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 191/191 [00:19<00:00, 9.70it/s] {'': device(type='cpu')}

Then I removed the device_map argument and tried to run again model = AutoAWQForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, use_cache=False ) print(model.model.hf_device_map)

met this exception: Traceback (most recent call last): File "/home/lantu_mp/common/Firefly/script/awq_quan.py", line 17, in <module> print(model.model.hf_device_map) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__ raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'") AttributeError: 'LlamaForCausalLM' object has no attribute 'hf_device_map'

It didn't show the layers like below: image

So I want to know if the names of layers are the same as in the model.safetensors.index.json file?