NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.65k stars 986 forks source link

Quantizing lm_head for gemma (and others) #1394

Closed dasistwo closed 5 months ago

dasistwo commented 7 months ago

System Info

TL;DR:

  1. Quantization for the lm_head was fake-quantized, at least with int4-awq and int8_sq configurations. Model was Gemma-2b, Gemma-7b and Llama-2-7b. How can I make it "real-quantized" to be compressed? (like weights are quantized with int4?)
  2. Using lm_head quantization for the Gemma model caused problems with the output, which has not happened in the Llama-2 model.

Environment

Who can help?

@Tracin

Information

Tasks

Reproduction

Fake-quantized lm_head

  1. Modify the preset in the quantize_by_ammo.py file in the package to enable lm_head quantization.
    
    # /usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_ammo.py:92

def quant_cfg_choices(): import ammo.torch.quantization as atq QUANT_CFG_CHOICES = { "int8_sq": atq.INT8_SMOOTHQUANT_CFG, "fp8": atq.FP8_DEFAULT_CFG, "w4a8_awq": atq.W4A8_AWQ_BETA_CFG, "int8_wo": EMPTY_CFG, "int4_wo": EMPTY_CFG, "full_prec": EMPTY_CFG, "int4_awq": { # Customized "quant_cfg": { "weight_quantizer": {"num_bits": 4, "block_sizes": {-1: 128}, "enable": True}, "input_quantizer": {"enable": False}, "lm_head": {"enable": True}, "output_layer": {"enable": False}, "default": {"enable": False}, }, "algorithm": {"method": "awq_lite", "alpha_step": 0.1}, } }


You can reproduce the same result for modifying the `int8_sq` configuration.

2. quantize through `examples/quantization/quantize.py` file.
``` bash
python3 ../quantization/quantize.py --model_dir ${HF_MODEL_PATH} \ 
--dtype float16 --qformat int4_awq --output_dir ${UNIFIED_CKPT_PATH} --tp_size 1

I found that the lm_head.weight was not quantized. For example, in the Gemma 7B model,

cd ${UNIFIED_CKPT_PATH}
python
>>> from safetensors import safe_open
>>> f = safe_open("rank0.safetensors", framework="pt", device=0)
>>> f.get_tensor("lm_head.weight").size()
torch.Size([256000, 3072])
>>> f.get_tensor("lm_head.weight").dtype
torch.float16
>>> for k in f.keys():
...   if 'lm_head' in k:
...     print(k)
... 
lm_head.activation_scaling_factor
lm_head.prequant_scaling_factor
lm_head.weight
lm_head.weights_scaling_factor

The result was same for the Gemma 2B and Llama-2 7B model.

Result of quantized lm_head

  1. build it

    trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} --gemm_plugin float16 \
    --gpt_attention_plugin float16 --lookup_plugin float16 --max_batch_size 8 \
    --max_input_len 256 --max_output_len 256 --gather_all_token_logits \
    --enable_xqa enable --context_fmha enable --output_dir ${ENGINE_PATH}
  2. Test it with summarize.py

    python3 ../summarize.py --test_trt_llm --engine_dir ${ENGINE_PATH} \
    --max_input_length 256 --batch_size 8 --max_ite 5 --eval_ppl \
    --vocab_file ${VOCAB_FILE_PATH}

The result from the Llama-2 was acceptable.

# lm_head quantized int4-AWQ llama-2 7B result
[04/02/2024-19:11:32] [TRT-LLM] [I] TensorRT-LLM (total latency: 5.553373098373413 sec)
[04/02/2024-19:11:32] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 4000)
[04/02/2024-19:11:32] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 720.2829576085213)
[04/02/2024-19:11:32] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[04/02/2024-19:11:32] [TRT-LLM] [I]   rouge1 : 8.275318677554012
[04/02/2024-19:11:32] [TRT-LLM] [I]   rouge2 : 0.04273504273504273
[04/02/2024-19:11:32] [TRT-LLM] [I]   rougeL : 6.7506633186696865
[04/02/2024-19:11:32] [TRT-LLM] [I]   rougeLsum : 7.960074361569913
[04/02/2024-19:11:32] [TRT-LLM] [I]   Per-token perplexity: 6.3687220275402066

# lm_head not quantized int4-AWQ llama-2 7B result
[04/02/2024-19:11:02] [TRT-LLM] [I] TensorRT-LLM (total latency: 5.527586936950684 sec)
[04/02/2024-19:11:02] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 4000)
[04/02/2024-19:11:02] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 723.6430734830951)
[04/02/2024-19:11:02] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[04/02/2024-19:11:02] [TRT-LLM] [I]   rouge1 : 5.290267273128518
[04/02/2024-19:11:02] [TRT-LLM] [I]   rouge2 : 0.6819251087961309
[04/02/2024-19:11:02] [TRT-LLM] [I]   rougeL : 4.785798335241817
[04/02/2024-19:11:02] [TRT-LLM] [I]   rougeLsum : 5.001144738190409
[04/02/2024-19:11:02] [TRT-LLM] [I]   Per-token perplexity: 5.8846661925315855

But the result from the Gemma-7B cannot even generate a single proper token.

# lm_head quantized int4-AWQ Gemma 7B result
[04/02/2024-15:09:58] [TRT-LLM] [I] TensorRT-LLM (total latency: 5.9773688316345215 sec)
[04/02/2024-15:09:58] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 4000)
[04/02/2024-15:09:58] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 669.1907614652237)
[04/02/2024-15:09:58] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[04/02/2024-15:09:58] [TRT-LLM] [I]   rouge1 : 0.0
[04/02/2024-15:09:58] [TRT-LLM] [I]   rouge2 : 0.0
[04/02/2024-15:09:58] [TRT-LLM] [I]   rougeL : 0.0
[04/02/2024-15:09:58] [TRT-LLM] [I]   rougeLsum : 0.0
[04/02/2024-15:09:58] [TRT-LLM] [I]   Per-token perplexity: inf

# lm_head not quantized int4-AWQ Gemma 7B result
[04/02/2024-15:09:23] [TRT-LLM] [I] TensorRT-LLM (total latency: 5.9766669273376465 sec)
[04/02/2024-15:09:23] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 3820)
[04/02/2024-15:09:23] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 639.152230907679)
[04/02/2024-15:09:23] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[04/02/2024-15:09:23] [TRT-LLM] [I]   rouge1 : 15.085234543827761
[04/02/2024-15:09:23] [TRT-LLM] [I]   rouge2 : 3.085591159231036
[04/02/2024-15:09:23] [TRT-LLM] [I]   rougeL : 10.889472087599273
[04/02/2024-15:09:23] [TRT-LLM] [I]   rougeLsum : 12.451584690196892
[04/02/2024-15:09:23] [TRT-LLM] [I]   Per-token perplexity: 7.5976266145706175

Expected behavior

Mentioned above

actual behavior

Mentioned above

additional notes

  1. If I apply the quantization for the lm_head, it was fake-quantized at least with the int4-weight only-AWQ and int8-smoothquant configuration. Models were Gemma-2b, Gemma-7b, Llama-2-7b.
  2. Using lm_head quantization for the Gemma model caused problems with the output, which has not happened in the Llama-2 model. +) How's lm_head and output_layer are different in the quantization configuration? What does output_layer mean in here?
Tracin commented 7 months ago

@dasistwo If *lm_head*: enable doesn't work, I think probably AMMO remove the support. @RalphMao Do we have any approaches to quantize lm_head now?

dasistwo commented 6 months ago

I think quantizing lm_head would give some space to the memory, especially for the small models. As far as I see the trend of models recently released is using a huge vocabulary size including llama-3. @Tracin Why was the feature removed? Was the significant accuracy drop expected?

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."