casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.67k stars 201 forks source link

ImportError: libcudart.so.12: cannot open shared object file: No such file or directory #284

Closed andysingal closed 5 months ago

andysingal commented 9 months ago

While running

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'mistralai/Mixtral-8x7B-Instruct-v0.1'
quant_path = 'mixtral-instruct-awq'
modules_to_not_convert = ["gate"]
quant_config = {
    "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM",
    "modules_to_not_convert": modules_to_not_convert
}

# Load model
# NOTE: pass safetensors=True to load safetensors
model = AutoAWQForCausalLM.from_pretrained(
    model_path, safetensors=True, **{"low_cpu_mem_usage": True}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(
    tokenizer,
    quant_config=quant_config,
    modules_to_not_convert=modules_to_not_convert
)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

getting error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[3], line 1
----> 1 from awq import AutoAWQForCausalLM
      2 from transformers import AutoTokenizer
      4 model_path = 'mistralai/Mixtral-8x7B-Instruct-v0.1'

File /usr/local/lib/python3.10/dist-packages/awq/__init__.py:2
      1 __version__ = "0.1.8"
----> 2 from awq.models.auto import AutoAWQForCausalLM

File /usr/local/lib/python3.10/dist-packages/awq/models/__init__.py:1
----> 1 from .mpt import MptAWQForCausalLM
      2 from .llama import LlamaAWQForCausalLM
      3 from .opt import OptAWQForCausalLM

File /usr/local/lib/python3.10/dist-packages/awq/models/mpt.py:1
----> 1 from .base import BaseAWQForCausalLM
      2 from transformers.models.mpt.modeling_mpt import MptBlock as OldMptBlock, MptForCausalLM
      4 class MptAWQForCausalLM(BaseAWQForCausalLM):

File /usr/local/lib/python3.10/dist-packages/awq/models/base.py:16
     13 import transformers
     14 from transformers.modeling_utils import shard_checkpoint
---> 16 from awq.modules.linear import WQLinear_GEMM, WQLinear_GEMV
     17 from awq.utils.module import (
     18     get_named_linears,
     19     set_op_by_name,
     20     exclude_layers_to_not_quantize,
     21 )
     22 from transformers import (
     23     AutoConfig,
     24     PreTrainedModel,
   (...)
     27     CLIPImageProcessor,
     28 )

File /usr/local/lib/python3.10/dist-packages/awq/modules/linear.py:4
      2 import torch
      3 import torch.nn as nn
----> 4 import awq_inference_engine  # with CUDA kernels
      7 def make_divisible(c, divisor):
      8     return (c + divisor - 1) // divisor

ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
casper-hansen commented 9 months ago

I am guessing you have CUDA 11.8 installed and used pip install autoawq which requires CUDA 12.1. You can instead run the following for CUDA 11.8 of AutoAWQ (Python 3.10):

pip install git+https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.8/autoawq-0.1.8+cu118-cp310-cp310-linux_x86_64.whl
exceedzhang commented 9 months ago

I used RTX4090 24GB running Mixtral-8x7B-Instruct-v0.1 AWQ out of memory! @casper-hansen It need more VRAM? I ran Qwen-72B-Chat it work well. WX20240102-164152@2x

casper-hansen commented 9 months ago

I have only been able to quantize Mixtral on 48GB VRAM.