NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.18k stars 908 forks source link

Empty outputs with TRT engine built with W4A8, FP8 KV cache #2133

Open dhruvmullick opened 3 weeks ago

dhruvmullick commented 3 weeks ago

System Info

tensorrt_llm 0.12.0.dev2024073000 CUDA 12.4 H100-PCIe

Who can help?

@Tracin @byshiue

Information

Tasks

Reproduction

python3 quantize.py --model_dir meta_llama_3_1_70B_instruct/fp16 \
  --dtype float16 \
  --qformat w4a8_awq \
  --kv_cache_dtype fp8 \
  --awq_block_size 128 \
  --output_dir /tmp/trt_checkpoint \
  --batch_size 8 \
  --calib_size 32
trtllm-build --checkpoint_dir /tmp/trt_checkpoint \
        --gemm_plugin auto \
        --gpt_attention_plugin auto \
        --paged_kv_cache enable \
        --remove_input_padding enable \
        --context_fmha enable \
        --use_fused_mlp \
        --max_seq_len 16000 \
        --max_num_tokens 16384 \
        --max_batch_size 8 \
        --output_dir w4a8_kvfp8 \
        --log_level verbose

Spawned a triton server and made curl request:

curl -X POST localhost:8000/v2/models/ensemble_meta_llama_3_1_70B_instruct/generate -d '{"text_input": "What is machine learning?", "max_tokens": 128, "bad_words": "", "stop_words": "", "pad_id": 128004, "end_id": 128009, "beam_width": 1}'

Expected behavior

Non blank output in "text_output" field. Example from another TRTLLM engine with different quantization:

{"context_logits":0.0,
"cum_log_probs":0.0,
"generation_logits":0.0,
"model_name":"ensemble_meta_llama_3_1_70B_instruct",
"model_version":"1",
"output_log_probs":[0.0,0.0,0.0,0.0,0.0],
"sequence_end":false,
"sequence_id":0,
"sequence_start":false,
"text_output":"assistant\n\nMachine learning is a subfield of artificial intelligence (AI) that involves the use of algorithms and statistical models to enable machines to perform a specific task without using explicit instructions, instead relying on patterns and inference. In traditional programming, a computer is given a set of rules and data, and it follows those rules to produce a result. In contrast, machine learning involves training a model on data, so it can learn the rules and make predictions or decisions on its own.\n\nMachine learning is based on the idea that machines can learn from data and improve their performance on a task over time, without being explicitly programmed for that task. This"}

actual behavior

Blank output in text_output field

{"context_logits":0.0,
"cum_log_probs":0.0,
"generation_logits":0.0,
"model_name":"ensemble_meta_llama_3_1_70B_instruct",
"model_version":"1",
"output_log_probs":[0.0,0.0,0.0,0.0,0.0],
"sequence_end":false,
"sequence_id":0,
"sequence_start":false,
"text_output":""}

additional notes

I see the same issue with w4a8_awq quantization + fp16 kv cache.

However, a model with int4_awq quantization + fp16 kv cache works.

So there must be some issue with w4a8_awq quantization.

Barry-Delaney commented 1 week ago

Thanks for the feedback @dhruvmullick, this is a known issue for ModelOpt (which is called in quantize.py). Once it got fixed, we will update here.