NetEase-FuXi / EETQ

Easy and Efficient Quantization for Transformers
Apache License 2.0
174 stars 14 forks source link

EETQ

中文README

Easy & Efficient Quantization for Transformers

Table of Contents

Features

Getting started

Environment

The above environment is the minimum configuration, and it is best to use a newer version.

Installation

Recommend using Dockerfile.

$ git clone https://github.com/NetEase-FuXi/EETQ.git
$ cd EETQ/
$ git submodule update --init --recursive
$ pip install .

If your machine has less than 96GB of RAM and lots of CPU cores, ninja might run too many parallel compilation jobs that could exhaust the amount of RAM. To limit the number of parallel compilation jobs, you can set the environment variable MAX_JOBS:

$ MAX_JOBS=4 pip install .

Usage

  1. Use EETQ in transformers.

    from transformers import AutoModelForCausalLM, EetqConfig
    path = "/path/to/model"
    quantization_config = EetqConfig("int8")
    model = AutoModelForCausalLM.from_pretrained(path, device_map="auto", quantization_config=quantization_config)

    A quantized model can be saved via "saved_pretrained" and be reused again via the "from_pretrained".

    quant_path = "/path/to/save/quantized/model"
    model.save_pretrained(quant_path)
    model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
  2. Quantize torch model

    from eetq.utils import eet_quantize
    eet_quantize(torch_model)
  3. Quantize torch model and optimize with flash attention

    
    ...
    model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16)
    from eetq.utils import eet_accelerator
    eet_accelerator(model, quantize=True, fused_attn=True, dev="cuda:0")
    model.to("cuda:0")

inference

res = model.generate(...)


4. Use EETQ in [TGI](https://github.com/huggingface/text-generation-inference). see [this PR](https://github.com/huggingface/text-generation-inference/pull/1068).
```bash
text-generation-launcher --model-id mistralai/Mistral-7B-v0.1 --quantize eetq ...
  1. Use EETQ in LoRAX. See docs here.

    lorax-launcher --model-id mistralai/Mistral-7B-v0.1 --quantize eetq ...
  2. Load quantized model in vllm (doing) Support vllm

    python -m vllm.entrypoints.openai.api_server --model /path/to/quantized/model  --quantization eetq --trust-remote-code

Examples

Model:

Performance