Open Atomheart-Father opened 1 month ago
I posted this problem in hugging-quants discussion page, they recommend me to open an issue here. https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4/discussions/13
Try to use the device_map argument when creating the model. Recently, HF made some changes to loading that is causing this issue
Try to use the device_map argument when creating the model. Recently, HF made some changes to loading that is causing this issue
I only have 8 A800, which is not enough for loading this model. So I cannot use device_map = auto. If I set this argument to cpu, this exception occured again. I want to know how can I use 1 single a800 to quantize this model as you mentioned on the huggingface page.
I would encourage you to look into how you can effectively use accelerate
since AutoAWQ relies on this library to load transformers
models. Specifically, you can design device_map
for large models by specifying exactly where each layer should be loaded. See the bottom of Designing a device map:
You should be able to inspect the current device_map like this:
AutoAWQForCausalLM.from_pretrained(...).model.hf_device_map
Additionally, I would also encourage you to experiment with how to load such large models using AutoModelForCausalLM.from_pretrained
and run inference because the AutoAWQ from_pretrained
just wraps this call to load the model in for quantization.
I would encourage you to look into how you can effectively use
accelerate
since AutoAWQ relies on this library to loadtransformers
models. Specifically, you can designdevice_map
for large models by specifying exactly where each layer should be loaded. See the bottom of Designing a device map:You should be able to inspect the current device_map like this:
AutoAWQForCausalLM.from_pretrained(...).model.hf_device_map
Additionally, I would also encourage you to experiment with how to load such large models using
AutoModelForCausalLM.from_pretrained
and run inference because the AutoAWQfrom_pretrained
just wraps this call to load the model in for quantization.
But the 405B model needs more than 800G of VRAM to load it, I only got 480G. Do you mean that I can avoid this error by manually loading some layers into the gpu and other layers into the cpu?
The reason you are running into OOM is that layers should be on the same device, but device_map=auto
or device_map=None
loads the layers in such a way that a single layer can have multiple devices (e.g. CPU and cuda:0). To prevent this, you can specify a device_map that loads as many layers onto GPUs as possible and the rest onto CPU. That way, a single layer only has a single device which should avoid the error you are seeing above.
So to be clear, 480GB VRAM is fine if you have 400GB of system RAM.
I wanted to inspect my current device_map, and tried to manually specifying the loadding.
When I tried
model = AutoAWQForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, use_cache=False, device_map='cpu' ) print(model.model.hf_device_map)
It returned
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 191/191 [00:19<00:00, 9.70it/s] {'': device(type='cpu')}
Then I removed the device_map argument and tried to run again
model = AutoAWQForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True, use_cache=False ) print(model.model.hf_device_map)
met this exception:
Traceback (most recent call last): File "/home/lantu_mp/common/Firefly/script/awq_quan.py", line 17, in <module> print(model.model.hf_device_map) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__ raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'") AttributeError: 'LlamaForCausalLM' object has no attribute 'hf_device_map'
It didn't show the layers like below:
So I want to know if the names of layers are the same as in the model.safetensors.index.json
file?
Package Version: AutoAWQ: 0.2.5+cu118 torch: 2.3.1+cu118 transformers: 4.43.3
I was try to quantize my finetuned llama3.1 405b (bf16) model to 4 bit using autoawq following the insturction in the huggingface https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 the code I used :
Exception:
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 191/191 [00:18<00:00, 10.47it/s] start quant /home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by promote_options='default'. table = cls._concat_blocks(blocks, axis=0) Traceback (most recent call last): File "/home/lantu_mp/common/Firefly/script/awq_quan.py", line 21, in <module> model.quantize(tokenizer, quant_config=quant_config) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/awq/models/base.py", line 170, in quantize self.quantizer = AwqQuantizer( File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/awq/quantize/quantizer.py", line 61, in __init__ self.modules, self.module_kwargs, self.inps = self.init_quant() File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/awq/quantize/quantizer.py", line 482, in init_quant self.model(samples.to(next(self.model.parameters()).device)) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1141, in forward outputs = self.model( File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 920, in forward position_embeddings = self.rotary_emb(hidden_states, position_ids) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/lantu_mp/.conda/envs/py3.9/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 153, in forward freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
I have enough cpu ram and 8 A800 GPU,due to the model card page, which is enough for quantizing this model