NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.7k stars 837 forks source link

kv-cache reuse in Qwen-72B-Chat is slower when tp>1 #1915

Open whitley0 opened 2 weeks ago

whitley0 commented 2 weeks ago

System Info

-GPU NVIDIA A100-SXM4-40GB -tensorrt_llm v0.10.0 -tensorrtllm_backend v0.10.0 -docker nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3

Who can help?

No response

Information

Tasks

Reproduction

  1. python3 convert_checkpoint.py --model_dir ${path_to_Qwen_72B} --output_dir ${path_to_trt_checkpoint} --dtype float16 --tp_size 8
  2. trtllm-build --checkpoint_dir ${path_to_trt_checkpoint} \ --gemm_plugin float16 \ --max_input_len 4096 \ --max_output_len 1024 \ --gemm_plugin float16 \ --output_dir ${path_to_engine_dir} \ --paged_kv_cache enable \ --remove_input_padding enable \ --use_paged_context_fmha enable \ --tokens_per_block 1024
  3. launch triton server in tensorrtllm_backend
  4. curl -X POST localhost:9000/v2/models/ensemble/generate \ -d '{"text_input": "<|im_start|>system\nI will provide some topics related to the study of philosophy, and it will be your job to explain these concepts in an easy-to-understand manner. This could include providing examples, posing questions or breaking down complex ideas into smaller pieces that are easier to comprehend. My first request is \"I need help understanding how different philosophical theories can be applied in everyday life.\"I want you to act as a math teacher. I will provide some mathematical equations or concepts, and it will be your job to explain them in easy-to-understand terms. This could include providing step-by-step instructions for solving a problem, demonstrating various techniques with visuals or suggesting online resources for further study. My first request is \"I need help understanding how probability works.\"I want you to act as a math teacher. I will provide some mathematical equations or concepts, and it will be your job to explain them in easy-to-understand terms. This could include providing step-by-step instructions for solving a problem, demonstrating various techniques with visuals or suggesting online resources for further study.I want you to act as a philosophy teacher. I will provide some topics related to the study of philosophy, and it will be your job to explain these concepts in an easy-to-understand manner. This could include providing examples, posing questions or breaking down complex ideas into smaller pieces that are easier to comprehend. My first request is \"I need help understanding how different philosophical theories can be applied in everyday life.\"I want you to act as a math teacher. I will provide some mathematical equations or concepts, and it will be your job to explain them in easy-to-understand terms. This could include providing step-by-step instructions for solving a problem, demonstrating various techniques with visuals or suggesting online resources for further study. My first request is \"I need help understanding how probability works.\"I want you to act as a math teacher. I will provide some mathematical equations or concepts, and it will be your job to explain them in easy-to-understand terms. This could include providing step-by-step instructions for solving a problem, demonstrating various techniques with visuals or suggesting online resources for further study. My first request is \"I need help understanding how probability works.\".<|im_end|>\n<|im_start|>user\n你好,你叫什么?<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 100, "bad_words": "", "stop_words": "", "end_id": [151645], "pad_id": [151645]}' prompt is meaningless just for test

Expected behavior

none

actual behavior

none

additional notes

second request is much slower than the first

RobotGF commented 2 weeks ago

I guess it may becacuse the size of tokens_per_block is too large, maybe you can try 64

whitley0 commented 2 weeks ago

I guess it may becacuse the size of tokens_per_block is too large, maybe you can try 64

thanks,I've noticed that the second request gets slower is due to an issue with the reuse of the kv-cache. When there are two duplicate segments in the system prompt, it results in an incorrect output.

curl -X POST localhost:8000/v2/models/ensemble/generate \ -d '{ "text_input": "<|im_start|>system\n请用二次元可爱语气和我说话\n请用二次元可爱语气和我说话\n<|im_end|>\n<|im_start|>user\n用户说:这个月月底发工资的,应该有个2000多,我现在手上凑到了2000多。<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 32, "bad_words": "", "stop_words": "", "end_id": [ 151645 ], "pad_id": [ 151645 ] }' gets output "maté9的 and and and的 burn是是是是是及的的的的的的是的是的 structure的=的的的的的" but it should be "哎呀呀,月底就要迎来小钱包鼓鼓的时刻啦,大概会有2000多的新朋友来陪你哦!你现在也攒到了"