Quantizing lm_head for gemma (and others)

dasistwo commented 7 months ago

System Info

TL;DR:

Quantization for the lm_head was fake-quantized, at least with int4-awq and int8_sq configurations. Model was Gemma-2b, Gemma-7b and Llama-2-7b. How can I make it "real-quantized" to be compressed? (like weights are quantized with int4?)
Using lm_head quantization for the Gemma model caused problems with the output, which has not happened in the Llama-2 model.

Environment

CPU architecture: x86_64
CPU/Host memory size: 251GB
GPU properties: NVIDIA A100 40GB x2
TensorRT-LLM branch: main-66ca337 (TensorRT-LLM version: 0.9.0.dev2024031900)
Container used: tensorrt-llm-release
Versions:
- TensorRT: 9.3.0.post12.dev1
- AMMO: 0.7.4
- CUDA: 12.3.0
- NVIDIA driver: 550.54.15
- OS: Ubuntu 22.04

Who can help?

@Tracin

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Fake-quantized lm_head

Modify the preset in the quantize_by_ammo.py file in the package to enable lm_head quantization.
```
# /usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_ammo.py:92
```

def quant_cfg_choices(): import ammo.torch.quantization as atq QUANT_CFG_CHOICES = { "int8_sq": atq.INT8_SMOOTHQUANT_CFG, "fp8": atq.FP8_DEFAULT_CFG, "w4a8_awq": atq.W4A8_AWQ_BETA_CFG, "int8_wo": EMPTY_CFG, "int4_wo": EMPTY_CFG, "full_prec": EMPTY_CFG, "int4_awq": { # Customized "quant_cfg": { "weight_quantizer": {"num_bits": 4, "block_sizes": {-1: 128}, "enable": True}, "input_quantizer": {"enable": False}, "lm_head": {"enable": True}, "output_layer": {"enable": False}, "default": {"enable": False}, }, "algorithm": {"method": "awq_lite", "alpha_step": 0.1}, } }


You can reproduce the same result for modifying the `int8_sq` configuration.

2. quantize through `examples/quantization/quantize.py` file.
``` bash
python3 ../quantization/quantize.py --model_dir ${HF_MODEL_PATH} \ 
--dtype float16 --qformat int4_awq --output_dir ${UNIFIED_CKPT_PATH} --tp_size 1

I found that the lm_head.weight was not quantized. For example, in the Gemma 7B model,

cd ${UNIFIED_CKPT_PATH}
python
>>> from safetensors import safe_open
>>> f = safe_open("rank0.safetensors", framework="pt", device=0)
>>> f.get_tensor("lm_head.weight").size()
torch.Size([256000, 3072])
>>> f.get_tensor("lm_head.weight").dtype
torch.float16
>>> for k in f.keys():
...   if 'lm_head' in k:
...     print(k)
... 
lm_head.activation_scaling_factor
lm_head.prequant_scaling_factor
lm_head.weight
lm_head.weights_scaling_factor

The result was same for the Gemma 2B and Llama-2 7B model.

Result of quantized lm_head

build it

trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} --gemm_plugin float16 \
--gpt_attention_plugin float16 --lookup_plugin float16 --max_batch_size 8 \
--max_input_len 256 --max_output_len 256 --gather_all_token_logits \
--enable_xqa enable --context_fmha enable --output_dir ${ENGINE_PATH}

Test it with summarize.py

python3 ../summarize.py --test_trt_llm --engine_dir ${ENGINE_PATH} \
--max_input_length 256 --batch_size 8 --max_ite 5 --eval_ppl \
--vocab_file ${VOCAB_FILE_PATH}

The result from the Llama-2 was acceptable.

# lm_head quantized int4-AWQ llama-2 7B result
[04/02/2024-19:11:32] [TRT-LLM] [I] TensorRT-LLM (total latency: 5.553373098373413 sec)
[04/02/2024-19:11:32] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 4000)
[04/02/2024-19:11:32] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 720.2829576085213)
[04/02/2024-19:11:32] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[04/02/2024-19:11:32] [TRT-LLM] [I]   rouge1 : 8.275318677554012
[04/02/2024-19:11:32] [TRT-LLM] [I]   rouge2 : 0.04273504273504273
[04/02/2024-19:11:32] [TRT-LLM] [I]   rougeL : 6.7506633186696865
[04/02/2024-19:11:32] [TRT-LLM] [I]   rougeLsum : 7.960074361569913
[04/02/2024-19:11:32] [TRT-LLM] [I]   Per-token perplexity: 6.3687220275402066

# lm_head not quantized int4-AWQ llama-2 7B result
[04/02/2024-19:11:02] [TRT-LLM] [I] TensorRT-LLM (total latency: 5.527586936950684 sec)
[04/02/2024-19:11:02] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 4000)
[04/02/2024-19:11:02] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 723.6430734830951)
[04/02/2024-19:11:02] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[04/02/2024-19:11:02] [TRT-LLM] [I]   rouge1 : 5.290267273128518
[04/02/2024-19:11:02] [TRT-LLM] [I]   rouge2 : 0.6819251087961309
[04/02/2024-19:11:02] [TRT-LLM] [I]   rougeL : 4.785798335241817
[04/02/2024-19:11:02] [TRT-LLM] [I]   rougeLsum : 5.001144738190409
[04/02/2024-19:11:02] [TRT-LLM] [I]   Per-token perplexity: 5.8846661925315855

But the result from the Gemma-7B cannot even generate a single proper token.

# lm_head quantized int4-AWQ Gemma 7B result
[04/02/2024-15:09:58] [TRT-LLM] [I] TensorRT-LLM (total latency: 5.9773688316345215 sec)
[04/02/2024-15:09:58] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 4000)
[04/02/2024-15:09:58] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 669.1907614652237)
[04/02/2024-15:09:58] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[04/02/2024-15:09:58] [TRT-LLM] [I]   rouge1 : 0.0
[04/02/2024-15:09:58] [TRT-LLM] [I]   rouge2 : 0.0
[04/02/2024-15:09:58] [TRT-LLM] [I]   rougeL : 0.0
[04/02/2024-15:09:58] [TRT-LLM] [I]   rougeLsum : 0.0
[04/02/2024-15:09:58] [TRT-LLM] [I]   Per-token perplexity: inf

# lm_head not quantized int4-AWQ Gemma 7B result
[04/02/2024-15:09:23] [TRT-LLM] [I] TensorRT-LLM (total latency: 5.9766669273376465 sec)
[04/02/2024-15:09:23] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 3820)
[04/02/2024-15:09:23] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 639.152230907679)
[04/02/2024-15:09:23] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[04/02/2024-15:09:23] [TRT-LLM] [I]   rouge1 : 15.085234543827761
[04/02/2024-15:09:23] [TRT-LLM] [I]   rouge2 : 3.085591159231036
[04/02/2024-15:09:23] [TRT-LLM] [I]   rougeL : 10.889472087599273
[04/02/2024-15:09:23] [TRT-LLM] [I]   rougeLsum : 12.451584690196892
[04/02/2024-15:09:23] [TRT-LLM] [I]   Per-token perplexity: 7.5976266145706175

Expected behavior

Mentioned above

actual behavior

Mentioned above

additional notes

If I apply the quantization for the lm_head, it was fake-quantized at least with the int4-weight only-AWQ and int8-smoothquant configuration. Models were Gemma-2b, Gemma-7b, Llama-2-7b.
Using lm_head quantization for the Gemma model caused problems with the output, which has not happened in the Llama-2 model. +) How's lm_head and output_layer are different in the quantization configuration? What does output_layer mean in here?

Tracin commented 7 months ago

@dasistwo If *lm_head*: enable doesn't work, I think probably AMMO remove the support. @RalphMao Do we have any approaches to quantize lm_head now?

dasistwo commented 6 months ago

I think quantizing lm_head would give some space to the memory, especially for the small models. As far as I see the trend of models recently released is using a huge vocabulary size including llama-3. @Tracin Why was the feature removed? Was the significant accuracy drop expected?

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

NVIDIA / TensorRT-LLM