NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.11k stars 896 forks source link

Qwen14B model result of long prompt is different with hf result #958

Open Lzhang-hub opened 7 months ago

Lzhang-hub commented 7 months ago

System Info

GPU: rtx8000 Diver version: 525.85.05 Cuda version: 12.0 Syetem: ubuntu20.04

Who can help?

No response

Information

Tasks

Reproduction

1、build qwen

python build.py --hf_model_dir ./tmp/Qwen/7B/ \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir ./tmp/Qwen/7B/trt_engines/fp16/1-gpu/

2、python3 ../run.py --input_text "long ............................." \
                  --max_output_len=50 \
                  --tokenizer_dir ./tmp/Qwen/7B/ \
                  --engine_dir=./tmp/Qwen/7B/trt_engines/int8_kv_cache_weight_only/1-gpu

The results are very different from those of hf.

Expected behavior

The results are similar to those of hf.

actual behavior

1

additional notes

I found a related issue https://github.com/NVIDIA/TensorRT-LLM/issues/836,Looking forward to your team’s attention and progress.

kisseternity commented 7 months ago

Recently I have also found there's a pretty much difference between huggingface outputs and trt_llm outputs of llama2 13B, both in fp16 precision. It's pretty hard to locate where the difference originates from. So many factors, including the sampling algorithm, transformer architecture optimization, paged attention, etc. However, I observe there's some degradation in the quality of outputs in many cases in the greedy search decoding strategy.