Empty outputs with TRT engine built with W4A8, FP8 KV cache

System Info

tensorrt_llm 0.12.0.dev2024073000 CUDA 12.4 H100-PCIe

Who can help?

@Tracin @byshiue

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

python3 quantize.py --model_dir meta_llama_3_1_70B_instruct/fp16 \
  --dtype float16 \
  --qformat w4a8_awq \
  --kv_cache_dtype fp8 \
  --awq_block_size 128 \
  --output_dir /tmp/trt_checkpoint \
  --batch_size 8 \
  --calib_size 32

trtllm-build --checkpoint_dir /tmp/trt_checkpoint \
        --gemm_plugin auto \
        --gpt_attention_plugin auto \
        --paged_kv_cache enable \
        --remove_input_padding enable \
        --context_fmha enable \
        --use_fused_mlp \
        --max_seq_len 16000 \
        --max_num_tokens 16384 \
        --max_batch_size 8 \
        --output_dir w4a8_kvfp8 \
        --log_level verbose

Spawned a triton server and made curl request:

curl -X POST localhost:8000/v2/models/ensemble_meta_llama_3_1_70B_instruct/generate -d '{"text_input": "What is machine learning?", "max_tokens": 128, "bad_words": "", "stop_words": "", "pad_id": 128004, "end_id": 128009, "beam_width": 1}'

Expected behavior

Non blank output in "text_output" field. Example from another TRTLLM engine with different quantization:

{"context_logits":0.0,
"cum_log_probs":0.0,
"generation_logits":0.0,
"model_name":"ensemble_meta_llama_3_1_70B_instruct",
"model_version":"1",
"output_log_probs":[0.0,0.0,0.0,0.0,0.0],
"sequence_end":false,
"sequence_id":0,
"sequence_start":false,
"text_output":"assistant\n\nMachine learning is a subfield of artificial intelligence (AI) that involves the use of algorithms and statistical models to enable machines to perform a specific task without using explicit instructions, instead relying on patterns and inference. In traditional programming, a computer is given a set of rules and data, and it follows those rules to produce a result. In contrast, machine learning involves training a model on data, so it can learn the rules and make predictions or decisions on its own.\n\nMachine learning is based on the idea that machines can learn from data and improve their performance on a task over time, without being explicitly programmed for that task. This"}

actual behavior

Blank output in text_output field

{"context_logits":0.0,
"cum_log_probs":0.0,
"generation_logits":0.0,
"model_name":"ensemble_meta_llama_3_1_70B_instruct",
"model_version":"1",
"output_log_probs":[0.0,0.0,0.0,0.0,0.0],
"sequence_end":false,
"sequence_id":0,
"sequence_start":false,
"text_output":""}

additional notes

I see the same issue with w4a8_awq quantization + fp16 kv cache.

However, a model with int4_awq quantization + fp16 kv cache works.

So there must be some issue with w4a8_awq quantization.

NVIDIA / TensorRT-LLM

Empty outputs with TRT engine built with W4A8, FP8 KV cache #2133