SqueezeBits / QUICK

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
MIT License
109 stars 6 forks source link

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

GitHub - Releases arXiv

Introducing QUICK, a collection of novel optimized CUDA kernels designed for faster inference of quantized Large Language Models (LLMs). QUICK addresses the shared memory write-back bank conflict issue in state-of-the-art mixed precision General Matrix Multiplication (GEMM) kernels.

Computation overview of original kernel and QUICK​​

🏎️ Why QUICK?

📝 About QUICK

QUICK eliminates shared memory write-back bank conflicts introduced in previous mixed precision GEMM kernels.

The bank conflicts arise when the dequantized weights are written back to shared memory for subsequent computations. Consequently, bank conflicts induce a significant number of stalls, thereby deteriorating the overall throughput of mixed precision GEMM, especially for workloads with large batches.

QUICK rearranges the quantized weight matrix offline to remove the bank conflicts effectively. This rearrangement aligns with the load and computation pattern of Tensor Cores in NVIDIA GPUs without the need for shared memory write-back.

🚀 Install

📖 Prerequisites

🏗️ Build from source

🔍 Usage

  1. Quantization: Perform AWQ with our kernel(QUICK) or original AWQ kernel(GEMM)

    python examples/basic_quant.py --model_path </path/to/hf-model> --quant_path </path/to/save/quant-model>  --version <QUICK or GEMM>
  2. Evaluation: Evaluate the quantized model on several tasks (we tested on 'wikitext' dataset)

    python examples/eval.py --model_path </path/to/quant-model> --tasks <tasks_to_evaluate>
  3. Benchmark: You can check the end-to-end benchmark data we attached below on your machine.

    python examples/benchmark.py --model_path </path/to/quant-model> --batch_size N

Below is an example for the simplest use of auto_awq with QUICK to quantize a model and inference after quantization:

Quantization & Inference Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models. ```python from quick.awq import AutoAWQForCausalLM from transformers import AutoTokenizer, TextStreamer model_path = 'mistralai/Mistral-7B-v0.1' quant_path = 'Mistral-7B-v0.1-awq-QUICK' quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "QUICK" } # Load model model = AutoAWQForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) # Quantize model.quantize(tokenizer, quant_config=quant_config) # Save quantized model model.save_quantized(quant_path) tokenizer.save_pretrained(quant_path) # Convert prompt to tokens prompt_template = "[INST] {prompt} [/INST]" # prompt = "Explain quantum physics to a five-year-old using only metaphors." prompt = "What is the birth year of Albert Einstein?"\ "and what famous equation is Albert Einstein known for?" tokens = tokenizer( prompt_template.format(prompt=prompt), return_tensors='pt' ).input_ids.cuda() # Generate output generation_output = model.generate( tokens, streamer=streamer, max_new_tokens=512 ) ```

📊 Benchmarks

These benchmarks highlight the improvement in both single mixed precision GEMM throughput and inference throughput of weight-only quantized LLMs. The results include measurements in Tera Operations per Second (TOPS) for single matrix multiplications across various M sizes, considering a matrix multiplication workload with a shape of (M x K x N), conducted on multiple GPU devices. Additionally, the benchmarks demonstrate the token generation throughput gain of representative weight-only quantized LLMs across diverse GPU devices. In the end-to-end benchmarks, we fixed the prefill length and decode length to 128 each in order to test various batch sizes on single GPUs.

To ensure fairness in testing, we used the same benchmark script in AutoAWQ. Notably, we observed that the perplexity of quantized LLMs remains consistent compared to AutoAWQ when using QUICK. It's important to note that benchmark data may vary across different GPUs and CPUs, as well as among different inference frameworks.

📈 Kernel benchmarks

Kernel Benchmark

| Device | Batch Size | FP16 (TFLOPS) | AWQ (TFLOPS) | QUICK (TFLOPS) | Speedup-FP16 | Speedup-AWQ | |----------|------------|---------------|--------------|----------------|--------------|-------------| | RTX 4090 | 1 | 0.8 | 3.11 | 3.12 | 290% | 0% | | | 64 | 56.4 | 91.39 | 111.99 | 99% | 23% | | | 128 | 138.2 | 104.36 | 138.59 | 0% | 33% | | A6000 | 1 | 0.7 | 2.21 | 2.28 | 226% | 3% | | | 64 | 39.4 | 44.85 | 80.42 | 104% | 79% | | | 128 | 81.7 | 46.05 | 83.62 | 2% | 82% | | L40 | 1 | 0.7 | 2.41 | 2.51 | 259% | 4% | | | 64 | 44.4 | 72.9 | 97.73 | 120% | 34% | | | 128 | 148 | 64.28 | 107.67 | -27% | 68% | | A100 | 1 | 1.4 | 2.94 | 3.52 | 151% | 20% | | | 16 | 22.6 | 34.23 | 48.14 | 113% | 41% | | | 32 | 46.4 | 48.83 | 69.03 | 49% | 41% | | | 64 | 91.9 | 57.43 | 76.31 | -17% | 33% | | | 128 | 157.4 | 58.46 | 94.03 | -40% | 61% |

"Speedup-FP16/AWQ" means the extent to which QUICK has become faster compared to FP16/AWQ kenel.

📈 End-to-end benchmarks

E2E Benchmark

| Model | Device | Batch Size | FP16 (tok/s) | AWQ (tok/s) | QUICK (tok/s) | Speedup-FP16 | Speedup-AWQ | |-------------|----------|------------|--------------|-------------|---------------|--------------|-------------| | Mistral-7B | RTX 4090 | 1 | 52.8 | 154.0 | 137.3 | 160% | -11% | | | | 64 | 2985.6 | 4465.9 | 4539.8 | 52% | 2% | | | | 256 | OOM | 5156.9 | 7316.9 | - | 42% | | Vicuna-13B | A6000 | 1 | 23.6 | 65.0 | 68.5 | 191% | 5% | | | | 64 | 1194.0 | 1241.3 | 2023.4 | 69% | 63% | | | | 256 | OOM | 1332.1 | 2330.2 | - | 75% | | Llama-2-13B | L40 | 1 | 23.5 | 70.2 | 72.5 | 208% | 3% | | | | 64 | 1315.2 | 1580.4 | 2262.4 | 72% | 43% | | | | 256 | OOM | 1611.3 | 3122.4 | - | 94% | | Llama-30B | A100 | 1 | 20.0 | 36.7 | 31.1 | 55% | -15% | | | | 64 | OOM | 695.2 | 1207.9 | - | 74% | | | | 128 | OOM | 759.4 | 1165.2 | - | 53% |

📈 vLLM benchmarks

We are actively working on the integration of QUICK into widely-used LLM frameworks. In this section, we present the throughput benchmark results of our initial version of vLLM integrated with QUICK.

| Model | FP16 (tok/s) | AWQ (tok/s) | QUICK (tok/s) | Speedup-FP16 | Speedup-AWQ | |-------------|--------------|-------------|---------------|--------------|-------------| | Vicuna-13B | 985.2 | 1030.4 | 1308.6 | 33% | 27% | | Llama-2-70B | OOM | 224.3 | 290.2 | - | 29% |

🙋‍♂️ Frequently Asked Questions

Inference fails due to the absence of the layernorm kernel. - Currently, for the optimized CUDA kernels of various layers like layernorm and fused multi-head attention layers, QUICK requires the installation of the AutoAWQ_kernels package. Please proceed to install the package using the provided link.

📚 Cite

If you find our code or QUICK useful for your research, please consider citing:

@misc{kim2024quick,
      title={QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference}, 
      author={Taesu Kim and Jongho Lee and Daehyun Ahn and Sarang Kim and Jiwoong Choi and Minkyu Kim and Hyungjun Kim},
      year={2024},
      eprint={2402.10076},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}