NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.58k stars 974 forks source link

There are differences in the results of Qwen2-7B Instruction #2032

Closed skyCreateXian closed 1 month ago

skyCreateXian commented 3 months ago

System Info

GPU:L20 Tensorrt-LLM:v0.11.0 transformers: 4.42.0

Who can help?

@ncomly-nvidia @kaiyux prompt='你好,请介绍一下喜马拉雅山的详细信息'

1、transformers

about params: generation_config = GenerationConfig( top_k=1, temperature=1, max_length=2048, max_new_tokens=80, repetition_penalty=1.0, early_stopping=True, do_sample=True, num_beams=1, top_p=1, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id ) transformers result: ` 喜马拉雅山(Himalayas)是地球上最高的山脉,位于亚洲南部,横跨中国、印度、尼泊尔、不丹、巴基斯坦和阿富汗等国家。以下是关于喜马拉雅山的一些详细信息:

地理位置与范围

喜马拉雅山脉从中国西藏的喜马拉雅山脉开始,向南延伸至印度的喜马拉雅山脉,, 128 `

2、Tensorrt-LLM

about params: batch_input_ids=input_ids, max_new_tokens=80, end_id=tokenizer.eos_token_id, pad_id=tokenizer.pad_token_id, top_k=1 Tensorrt-LLM result: ` 你好!喜马拉雅山(Himalayas)是地球上最壮观的山脉之一,位于亚洲南部,横跨中国、印度、尼泊尔、不丹、巴基斯坦和阿富汗等国家。以下是关于喜马拉雅山的一些详细信息:

地理位置与范围

喜马拉雅山脉从中国西藏的喜马拉雅山脉开始,向南延伸至印度的 `

3、how to create input_ids?

` prompt='你好,请介绍一下喜马拉雅山的详细信息' messages = [{"role": "user", "content": prompt}]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) input_ids = tokenizer(prompt, truncation=True, return_tensors="pt", add_special_tokens=False)['input_ids'] `

4、build Qwen2-7B engine

` python convert_checkpoint.py --model_dir /mnt/qwen2/Qwen2-7B-Instruct \ --output_dir checkpoint \ --dtype float16

trtllm-build --checkpoint_dir ./checkpoint \ --output_dir ./fp16 \ --gemm_plugin float16 `

Information

Tasks

Reproduction

  1. Test transformers and Tensorrt-LLM results separately using the same input
  2. Comparing the generated token and prompt will reveal differences

Expected behavior

1、I hope qwen2 can be perfectly aligned

actual behavior

1、There are some differences in the results 2、Tested many cases, with approximately 5-10% not fully aligned

additional notes

Nothing

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] commented 1 month ago

This issue was closed because it has been stalled for 15 days with no activity.