Closed AndyZZt closed 1 month ago
Can you try turning off inflightbatching? This may be a known bug (of inflightbatching) that will be fixed in the latest commit.
Can you try turning off inflightbatching? This may be a known bug (of inflightbatching) that will be fixed in the latest commit.
I already turned off in-flight batching by setting"batching_strategy:v1" in tensorrt_llm/config.pbtxt. Or do you mean change the parameters while building TensorRT engine? By the way, I saw someone else got all-well answers using almost the same command lines as me, except he used TensorRT-LLM&tensorllm_backend repository version 0.8.0 while mine are 0.9.0.
Thanks. Please try disabling these options. I think one of them may be causing the problem,
--context_fmha enable --paged_kv_cache enable \
--streamingllm enable \
--use_paged_context_fmha enable --enable_chunked_context\
--use_context_fmha_for_generation enable \
In addition, I recommend using the main branch to compile TRT-LLM because it contains the latest bug fixes
Thanks. Please try disabling these options. I think one of them may be causing the problem,
--context_fmha enable --paged_kv_cache enable \ --streamingllm enable \ --use_paged_context_fmha enable --enable_chunked_context\ --use_context_fmha_for_generation enable \
In addition, I recommend using the main branch to compile TRT-LLM because it contains the latest bug fixes
Thank you for your guidance! After disabling all the options you mentioned, I can use it under v1 mode normally. However, I still have two questions: 1. Which specific parameter caused the bug, and has it been reported? 2. Does everyone else encounter this particular bug when using it? If not, what should I do in the 0.9.0 version to enable the in-flight mode? Because I encountered lots of version adaptation problems when deploying the latest version. Many thanks.
Thanks. Please try disabling these options. I think one of them may be causing the problem,
--context_fmha enable --paged_kv_cache enable \ --streamingllm enable \ --use_paged_context_fmha enable --enable_chunked_context\ --use_context_fmha_for_generation enable \
In addition, I recommend using the main branch to compile TRT-LLM because it contains the latest bug fixes
Thank you for your guidance! After disabling all the options you mentioned, I can use it under v1 mode normally. However, I still have two questions: 1. Which specific parameter caused the bug, and has it been reported? 2. Does everyone else encounter this particular bug when using it? If not, what should I do in the 0.9.0 version to enable the in-flight mode? Because I encountered lots of version adaptation problems when deploying the latest version. Many thanks.
We will update the main branch code soon (today?), including the triton server so that you can try the new version at that time, and please give feedback if you have the same issue. For the root cause, try turning off these options one by one to reduce the performance regression. The older TRT-LLM version of v0.9 is not recommended. The Main Branch always contains the latest bug fixes.
Thanks. Please try disabling these options. I think one of them may be causing the problem,
--context_fmha enable --paged_kv_cache enable \ --streamingllm enable \ --use_paged_context_fmha enable --enable_chunked_context\ --use_context_fmha_for_generation enable \
In addition, I recommend using the main branch to compile TRT-LLM because it contains the latest bug fixes
Thank you for your guidance! After disabling all the options you mentioned, I can use it under v1 mode normally. However, I still have two questions: 1. Which specific parameter caused the bug, and has it been reported? 2. Does everyone else encounter this particular bug when using it? If not, what should I do in the 0.9.0 version to enable the in-flight mode? Because I encountered lots of version adaptation problems when deploying the latest version. Many thanks.
We will update the main branch code soon (today?), including the triton server so that you can try the new version at that time, and please give feedback if you have the same issue. For the root cause, try turning off these options one by one to reduce the performance regression. The older TRT-LLM version of v0.9 is not recommended. The Main Branch always contains the latest bug fixes.
Thanks, I'll try the main branch. I tried to turn on the parameters one by one and finally found that it was all about the "streamingllm" parameter, which supports long context and always keeps the first S tokens in the attention window. After turning off it, in-flight worked. However, when I turn on streamingllm and send parameters it required(max_attention_window_size&sink_attention_size) in http post, triton give back repeated answers again. So it's another story about streamingllm feature :) Thanks for help again. I'll close this issue.
Issues about streamingllm feature.
Environment CPU architecture: x86_64 CPU/Host memory size: 32G GPU properties: SM86 GPU name: NVIDIA A10 GPU memory size: 24G Clock frequencies used: 1695MHz Libraries TensorRT-LLM: v0.9.0 TensorRT: 9.3.0.post12.dev1 (display 8.6.3 while input "dpkg -l | grep nvinfer" in cmd),
CUDA: 12.3, Container used : 24.04-trtllm-python-py3 NVIDIA driver version: 535.161.08 OS : Ubuntu 22.04
Reproduction Steps
Expected Behavior
normal answer
Actual Behavior![image](https://github.com/NVIDIA/TensorRT-LLM/assets/29043558/266db8c8-d67f-40a3-9bba-cea9cb08d16f)
Additional Notes When I simply run TensorRT-LLM locally, via
The model can answer normally.