Error when quantizing the Qwen2-7B model

When I quantized the model of Qwen2-7B (not fine-tuned) using the quantization code below, I got the following error： quantization code

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch
model_path = '/user/Qwen2-7B'
quant_path = '/user/Qwen2-7B-4bit'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
# 根据https://github.com/casper-hansen/AutoAWQ/issues/498作修改，加上了torch_dtype=torch.bfloat16这个参数，但是仍然报错，故删去
model = AutoAWQForCausalLM.from_pretrained(model_path)
# 使用 SafeTensors 格式保存模型参数，确保存储的安全性和加载的效率
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True,safetensors=True)

# Quantize
# 不包含校准过程
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
# shard_size="4GB"将量化后的模型分割成多个不超过 4GB 的小文件，方便存储和传输，同时确保模型文件不会超过文件系统的单文件大小限制。
model.save_quantized(quant_path, safetensors=True, shard_size="4GB")
tokenizer.save_pretrained(quant_path)

Error

/AutoAWQ/awq/modules/linear/gemv_fast.py:10: UserWarning: AutoAWQ could not load GEMVFast kernels extension. Details: No module named 'awq_v2_ext'
  warnings.warn(f"AutoAWQ could not load GEMVFast kernels extension. Details: {ex}")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:02,  1.00it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:01<00:01,  1.02it/s]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:02<00:00,  1.03it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.10it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.07it/s]
Using the latest cached version of the dataset since mit-han-lab/pile-val-backup couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /root/.cache/huggingface/datasets/mit-han-lab___pile-val-backup/default/0.0.0/2f5e46ae6a69cf0dce4b12f78241c408936ca0e4 (last modified on Wed Jul 31 09:21:55 2024).
Token indices sequence length is longer than the specified maximum sequence length for this model (57053 > 32768). Running this sequence through the model will result in indexing errors
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)

AWQ:   0%|          | 0/28 [00:00<?, ?it/s]
AWQ:   4%|▎         | 1/28 [01:28<39:52, 88.60s/it]
AWQ:   7%|▋         | 2/28 [03:00<39:13, 90.51s/it]
AWQ:  11%|█         | 3/28 [04:34<38:24, 92.18s/it]
AWQ:  14%|█▍        | 4/28 [06:09<37:21, 93.39s/it]
AWQ:  18%|█▊        | 5/28 [07:44<36:00, 93.96s/it]
AWQ:  21%|██▏       | 6/28 [09:19<34:31, 94.14s/it]
AWQ:  25%|██▌       | 7/28 [10:54<33:05, 94.56s/it]
AWQ:  29%|██▊       | 8/28 [12:29<31:35, 94.78s/it]
AWQ:  32%|███▏      | 9/28 [14:04<30:01, 94.80s/it]
AWQ:  36%|███▌      | 10/28 [15:39<28:27, 94.87s/it]
AWQ:  39%|███▉      | 11/28 [17:15<26:54, 94.99s/it]
AWQ:  43%|████▎     | 12/28 [18:50<25:21, 95.08s/it]
AWQ:  46%|████▋     | 13/28 [20:25<23:48, 95.20s/it]
AWQ:  50%|█████     | 14/28 [22:00<22:11, 95.13s/it]
AWQ:  54%|█████▎    | 15/28 [23:36<20:37, 95.20s/it]
AWQ:  57%|█████▋    | 16/28 [25:11<19:02, 95.19s/it]
AWQ:  61%|██████    | 17/28 [26:46<17:28, 95.28s/it]
AWQ:  64%|██████▍   | 18/28 [28:22<15:53, 95.36s/it]
AWQ:  68%|██████▊   | 19/28 [29:58<14:19, 95.55s/it]
AWQ:  71%|███████▏  | 20/28 [31:33<12:43, 95.46s/it]
AWQ:  75%|███████▌  | 21/28 [33:09<11:08, 95.44s/it]
AWQ:  79%|███████▊  | 22/28 [34:44<09:33, 95.55s/it]
AWQ:  82%|████████▏ | 23/28 [36:20<07:57, 95.58s/it]
AWQ:  86%|████████▌ | 24/28 [37:56<06:22, 95.67s/it]
AWQ:  89%|████████▉ | 25/28 [39:32<04:47, 95.73s/it]
AWQ:  93%|█████████▎| 26/28 [41:07<03:11, 95.60s/it]
AWQ:  96%|█████████▋| 27/28 [42:43<01:35, 95.57s/it]
AWQ:  96%|█████████▋| 27/28 [42:50<01:35, 95.22s/it]
Traceback (most recent call last):
  File "/user/AutoAWQ/Quantalize4bit.py", line 15, in <module>
    # 不包含校准过程
  File "/root/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/user/AutoAWQ/awq/models/base.py", line 231, in quantize
    self.quantizer.quantize()
  File "/user/AutoAWQ/awq/quantize/quantizer.py", line 166, in quantize
    scales_list = [
  File "/user/AutoAWQ/awq/quantize/quantizer.py", line 167, in <listcomp>
    self._search_best_scale(self.modules[i], **layer)
  File "/root/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/user/AutoAWQ/awq/quantize/quantizer.py", line 330, in _search_best_scale
    best_scales = self._compute_best_scale(
  File "/user/AutoAWQ/awq/quantize/quantizer.py", line 409, in _compute_best_scale
    raise Exception
Exception

I kept looking up issues and thought it was a problem of version construction, so I downloaded cuda11.8(my driver version is cuda11.7, The python version is 3.10) corresponding to the whl file (autoawq-0.2.6-cp310-cp310-linux_x86_64.whl) and install, and then run the quantization code, the result has remained at 0%

/AutoAWQ/awq/modules/linear/exllama.py:12: UserWarning: AutoAWQ could not load ExLlama kernels extension. Details: libcudart.so.11.0: cannot open shared object file: No such file or directory
  warnings.warn(f"AutoAWQ could not load ExLlama kernels extension. Details: {ex}")
/AutoAWQ/awq/modules/linear/exllamav2.py:13: UserWarning: AutoAWQ could not load ExLlamaV2 kernels extension. Details: libcudart.so.11.0: cannot open shared object file: No such file or directory
  warnings.warn(f"AutoAWQ could not load ExLlamaV2 kernels extension. Details: {ex}")
/AutoAWQ/awq/modules/linear/gemm.py:14: UserWarning: AutoAWQ could not load GEMM kernels extension. Details: libcudart.so.11.0: cannot open shared object file: No such file or directory
  warnings.warn(f"AutoAWQ could not load GEMM kernels extension. Details: {ex}")
/AutoAWQ/awq/modules/linear/gemv.py:11: UserWarning: AutoAWQ could not load GEMV kernels extension. Details: libcudart.so.11.0: cannot open shared object file: No such file or directory
  warnings.warn(f"AutoAWQ could not load GEMV kernels extension. Details: {ex}")
/AutoAWQ/awq/modules/linear/gemv_fast.py:10: UserWarning: AutoAWQ could not load GEMVFast kernels extension. Details: No module named 'awq_v2_ext'
  warnings.warn(f"AutoAWQ could not load GEMVFast kernels extension. Details: {ex}")
/root/anaconda3/envs/qwen/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/root/anaconda3/envs/qwen/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.02it/s]
Using the latest cached version of the dataset since mit-han-lab/pile-val-backup couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /root/.cache/huggingface/datasets/mit-han-lab___pile-val-backup/default/0.0.0/2f5e46ae6a69cf0dce4b12f78241c408936ca0e4 (last modified on Wed Jul 31 09:21:55 2024).
Token indices sequence length is longer than the specified maximum sequence length for this model (57053 > 32768). Running this sequence through the model will result in indexing errors
/root/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
AWQ:   0%|                                                                                                                                                    | 0/28 [00:00<?, ?it/s]

Could I request your assistance? I would be very grateful.

casper-hansen / AutoAWQ

Error when quantizing the Qwen2-7B model #574