ModelCloud / GPTQModel

Apache License 2.0
21 stars 8 forks source link

GPTQModel

An easy-to-use LLM quantization and inference toolkit based on GPTQ algorithm (weight-only quantization).

GitHub release PyPI - Downloads

News

Mission Statement

We want GPTQModel to be highly focused on GPTQ based quantization and target inference compatibility with HF Transformers, vLLM, and SGLang.

How is GPTQModel different from AutoGPTQ?

GPTQModel is an opinionated fork/refactor of AutoGPTQ with latest bug fixes, more model support, faster quant inference, faster quantization, better quants (as measured in PPL) and a pledge from the ModelCloud team and that we, along with the open-source ML community, will take every effort to bring the library up-to-date with latest advancements, model support, and bug fixes.

We will backport bug fixes to AutoGPTQ on a case-by-case basis.

Major Changes (Advantages) vs AutoGPTQ

Roadmap (Target Date: July 2024):

Model Support ( 🚀 GPTQModel only )

Model
Baichuan ✅ DeepSeek-V2-Lite 🚀 Llama ✅ Phi/Phi-3 🚀
Bloom ✅ Falon ✅ LongLLaMA ✅ Qwen ✅
ChatGLM 🚀 Gemma 2 🚀 MiniCPM 🚀 Qwen2MoE 🚀
CodeGen ✅ GPTBigCod ✅ Mistral ✅ RefinedWeb ✅
Cohere ✅ GPTNeoX ✅ Mixtral ✅ StableLM ✅
DBRX Converted 🚀 GPT-2 ✅ MOSS ✅ StarCoder2 ✅
Deci ✅ GPT-J ✅ MPT ✅ XVERSE ✅
DeepSeek-V2 🚀 InternLM ✅ OPT ✅ Yi ✅

Compatiblity

We aim for 100% compatibility with models quanted by AutoGPTQ <= 0.7.1 and will consider syncing future compatibilty on a case-by-case basis.

Platform/GPU Requirements

GPTQModel is currently Linux only and requires CUDA capability >= 6.0 Nvidia GPU.

WSL on Windows should work as well.

ROCM/AMD support will be re-added in a future version after everything on ROCM has been validated. Only fully validated features will be re-added from the original AutoGPTQ repo.

Install

Install from source

# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel

# compile and install
pip install -vvv --no-build-isolation .

# If you have `uv` package version 0.1.16 or higher, you can use `uv pip` for potentially better dependency management
uv pip install -vvv --no-build-isolation .

Script installation

bash install.sh

PIP (PENDING RELEASE)

pip install gptq-model --no-build-isolation

Quantization and Inference

warning: this is just a showcase of the usage of basic apis in GPTQModel, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.

Below is an example for the simplest use of gptqmodel to quantize a model and inference after quantization:

from transformers import AutoTokenizer
from gptqmodel import GPTQModel, QuantizeConfig

pretrained_model_dir = "facebook/opt-125m"
quant_output_dir = "opt-125m-4bit"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
calibration_dataset = [
    tokenizer(
        "The world is a wonderful place full of beauty and love."
    )
]

quant_config = QuantizeConfig(
    bits=4,  # 4-bit
    group_size=128,  # 128 is good balance between quality and performance
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = GPTQModel.from_pretrained(pretrained_model_dir, quant_config)

# quantize model, the calibration_dataset should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(calibration_dataset)

# save quantized model
model.save_quantized(quant_output_dir)

# load quantized model to the first GPU
model = GPTQModel.from_quantized(quant_output_dir)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("gptqmodel is", return_tensors="pt").to(model.device))[0]))

For more advanced features of model quantization, please reference to this script

How to Add Support for a New Model

Read the gptqmodel/models/llama.py code which explains in detail via comments how the model support is defined. Use it as guide to PR for to new models. Most models follow the same pattern.

Evaluation on Downstream Tasks

You can use tasks defined in gptqmodel.eval_tasks to evaluate model's performance on specific down-stream task before and after quantization.

The predefined tasks support all causal-language-models implemented in 🤗 transformers and in this project.

Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset: ```python from functools import partial import datasets from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig from gptqmodel import GPTQModel, QuantizeConfig from gptqmodel.eval_tasks import SequenceClassificationTask MODEL = "EleutherAI/gpt-j-6b" DATASET = "cardiffnlp/tweet_sentiment_multilingual" TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:" ID2LABEL = { 0: "negative", 1: "neutral", 2: "positive" } LABELS = list(ID2LABEL.values()) def ds_refactor_fn(samples): text_data = samples["text"] label_data = samples["label"] new_samples = {"prompt": [], "label": []} for text, label in zip(text_data, label_data): prompt = TEMPLATE.format(labels=LABELS, text=text) new_samples["prompt"].append(prompt) new_samples["label"].append(ID2LABEL[label]) return new_samples # model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0") model = GPTQModel.from_pretrained(MODEL, QuantizeConfig()) tokenizer = AutoTokenizer.from_pretrained(MODEL) task = SequenceClassificationTask( model=model, tokenizer=tokenizer, classes=LABELS, data_name_or_path=DATASET, prompt_col_name="prompt", label_col_name="label", **{ "num_samples": 1000, # how many samples will be sampled to evaluation "sample_max_len": 1024, # max tokens for each sample "block_max_len": 2048, # max tokens for each data block # function to load dataset, one must only accept data_name_or_path as input # and return datasets.Dataset "load_fn": partial(datasets.load_dataset, name="english"), # function to preprocess dataset, which is used for datasets.Dataset.map, # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name] "preprocess_fn": ds_refactor_fn, # truncate label when sample's length exceed sample_max_len "truncate_prompt": False } ) # note that max_new_tokens will be automatically specified internally based on given classes print(task.run()) # self-consistency print( task.run( generation_config=GenerationConfig( num_beams=3, num_return_sequences=3, do_sample=True ) ) ) ```

Learn More

tutorials provide step-by-step guidance to integrate gptqmodel with your own project and some best practice principles.

examples provide plenty of example scripts to use gptqmodel in different ways.

Supported Evaluation Tasks

Currently, gptqmodel supports: LanguageModelingTask, SequenceClassificationTask and TextSummarizationTask; more Tasks will come soon!

Which kernel is used by default?

GPTQModel will use Marlin, Exllama v2, Triton kernels in that order for maximum inference performance.

Acknowledgements