Qwen14B model result of long prompt is different with hf result

Lzhang-hub commented 10 months ago

System Info

GPU: rtx8000 Diver version: 525.85.05 Cuda version: 12.0 Syetem: ubuntu20.04

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

1、build qwen

python build.py --hf_model_dir ./tmp/Qwen/7B/ \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir ./tmp/Qwen/7B/trt_engines/fp16/1-gpu/

2、python3 ../run.py --input_text "long ............................." \
                  --max_output_len=50 \
                  --tokenizer_dir ./tmp/Qwen/7B/ \
                  --engine_dir=./tmp/Qwen/7B/trt_engines/int8_kv_cache_weight_only/1-gpu

The results are very different from those of hf.

Expected behavior

The results are similar to those of hf.

actual behavior

1

additional notes

I found a related issue https://github.com/NVIDIA/TensorRT-LLM/issues/836,Looking forward to your team’s attention and progress.

kisseternity commented 10 months ago

Recently I have also found there's a pretty much difference between huggingface outputs and trt_llm outputs of llama2 13B, both in fp16 precision. It's pretty hard to locate where the difference originates from. So many factors, including the sampling algorithm, transformer architecture optimization, paged attention, etc. However, I observe there's some degradation in the quality of outputs in many cases in the greedy search decoding strategy.

nv-guomingz commented 1 week ago

would u please try our latest code base to see if the issue still exists?

And do u still have further issue or question now? If not, we'll close it soon.

NVIDIA / TensorRT-LLM