Open whitley0 opened 4 months ago
I guess it may becacuse the size of tokens_per_block is too large, maybe you can try 64
I guess it may becacuse the size of tokens_per_block is too large, maybe you can try 64
thanks,I've noticed that the second request gets slower is due to an issue with the reuse of the kv-cache. When there are two duplicate segments in the system prompt, it results in an incorrect output.
curl -X POST localhost:8000/v2/models/ensemble/generate \ -d '{ "text_input": "<|im_start|>system\n请用二次元可爱语气和我说话\n请用二次元可爱语气和我说话\n<|im_end|>\n<|im_start|>user\n用户说:这个月月底发工资的,应该有个2000多,我现在手上凑到了2000多。<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 32, "bad_words": "", "stop_words": "", "end_id": [ 151645 ], "pad_id": [ 151645 ] }' gets output "maté9的 and and and的 burn是是是是是及的的的的的的是的是的 structure的=的的的的的" but it should be "哎呀呀,月底就要迎来小钱包鼓鼓的时刻啦,大概会有2000多的新朋友来陪你哦!你现在也攒到了"
@whitley0 If you have no further questions, we will close this issue in one week.
System Info
-GPU NVIDIA A100-SXM4-40GB -tensorrt_llm v0.10.0 -tensorrtllm_backend v0.10.0 -docker nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
none
actual behavior
none
additional notes
second request is much slower than the first