Hi, I am working on some quantification in AWQ. There are no issues when quantifying llama using Autoawq. However, I encountered some problems when quantifying falcon 40b. The quantification code for llama and falcon is the same, referring to the code provided in the README. There seems to be an issue with falcon during inference.
Replacing layers...: 100%|████████████████████████████████████████████████████████████████████████████| 60/60 [00:03<00:00, 16.93it/s]
Traceback (most recent call last):
File "test.py", line 59, in <module>
model = AutoAWQForCausalLM.from_quantized(model_path)
File "/root/miniconda3/lib/python3.8/site-packages/awq/models/auto.py", line 52, in from_quantized
return AWQ_CAUSAL_LM_MODEL_MAP[model_type].from_quantized(
File "/root/miniconda3/lib/python3.8/site-packages/awq/models/base.py", line 171, in from_quantized
load_checkpoint_and_dispatch(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/big_modeling.py", line 556, in load_checkpoint_and_dispatch
return dispatch_model(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/big_modeling.py", line 396, in dispatch_model
attach_align_device_hook_on_blocks(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 547, in attach_align_device_hook_on_blocks
attach_align_device_hook_on_blocks(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 547, in attach_align_device_hook_on_blocks
attach_align_device_hook_on_blocks(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 547, in attach_align_device_hook_on_blocks
attach_align_device_hook_on_blocks(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 517, in attach_align_device_hook_on_blocks
add_hook_to_module(module, hook)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 156, in add_hook_to_module
module = hook.init_hook(module)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 254, in init_hook
set_module_tensor_to_device(module, name, self.execution_device)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 281, in set_module_tensor_to_device
raise ValueError(f"{tensor_name} is on the meta device, we need a `value` to put in on {device}.")
ValueError: scales is on the meta device, we need a `value` to put in on 0.
Hi, I am working on some quantification in AWQ. There are no issues when quantifying llama using Autoawq. However, I encountered some problems when quantifying falcon 40b. The quantification code for llama and falcon is the same, referring to the code provided in the README. There seems to be an issue with falcon during inference.