TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Non blank output in "text_output" field. Example from another TRTLLM engine with different quantization:
{"context_logits":0.0,
"cum_log_probs":0.0,
"generation_logits":0.0,
"model_name":"ensemble_meta_llama_3_1_70B_instruct",
"model_version":"1",
"output_log_probs":[0.0,0.0,0.0,0.0,0.0],
"sequence_end":false,
"sequence_id":0,
"sequence_start":false,
"text_output":"assistant\n\nMachine learning is a subfield of artificial intelligence (AI) that involves the use of algorithms and statistical models to enable machines to perform a specific task without using explicit instructions, instead relying on patterns and inference. In traditional programming, a computer is given a set of rules and data, and it follows those rules to produce a result. In contrast, machine learning involves training a model on data, so it can learn the rules and make predictions or decisions on its own.\n\nMachine learning is based on the idea that machines can learn from data and improve their performance on a task over time, without being explicitly programmed for that task. This"}
System Info
tensorrt_llm 0.12.0.dev2024073000 CUDA 12.4 H100-PCIe
Who can help?
@Tracin @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Spawned a triton server and made curl request:
Expected behavior
Non blank output in "text_output" field. Example from another TRTLLM engine with different quantization:
actual behavior
Blank output in text_output field
additional notes
I see the same issue with w4a8_awq quantization + fp16 kv cache.
However, a model with int4_awq quantization + fp16 kv cache works.
So there must be some issue with w4a8_awq quantization.