Open Lzhang-hub opened 10 months ago
Recently I have also found there's a pretty much difference between huggingface outputs and trt_llm outputs of llama2 13B, both in fp16 precision. It's pretty hard to locate where the difference originates from. So many factors, including the sampling algorithm, transformer architecture optimization, paged attention, etc. However, I observe there's some degradation in the quality of outputs in many cases in the greedy search decoding strategy.
would u please try our latest code base to see if the issue still exists?
And do u still have further issue or question now? If not, we'll close it soon.
System Info
GPU: rtx8000 Diver version: 525.85.05 Cuda version: 12.0 Syetem: ubuntu20.04
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
1、build qwen
The results are very different from those of hf.
Expected behavior
The results are similar to those of hf.
actual behavior
1
additional notes
I found a related issue
https://github.com/NVIDIA/TensorRT-LLM/issues/836
,Looking forward to your team’s attention and progress.