Open siddhatiwari opened 5 months ago
Thank you for finding the issue. I have started troubleshooting.
Any updates on this, I am also facing the same buffer size issue
Locating tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TokenPtr tensorrt_llm::batch_manager::TrtGptModelInflightBatching::decoderStepAsync(tensorrt_llm::batch_manager::RequestTable&, const ReqIdsVec&, const ReqIdsVec&) crashes, but the code may be closed source
Any updates on this? It would be great to see the full speedup from this feature https://github.com/NVIDIA/TensorRT-LLM/issues/317#issuecomment-1810841752
System Info
CPU architecture: x86_64 Host RAM: 1TB GPU: 8xH100 SXM Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend (nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 doesn't work for TRT LLM main branch?) TRT LLM v0.9 main branch (https://github.com/NVIDIA/TensorRT-LLM/commit/850b6fa1e710d25769f2b560d897d2bd424a645e) Driver Version: 535.161.07 CUDA Version: 12.2 OS: Ubuntu 22.04
Who can help?
@byshiue @Shixiaowei02
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Running engine at high queries per second causes errors and incomplete output. With very low
max_num_tokens
, outputs are incomplete even at very low queries per second.Build llama engine with
use_paged_context_fmha=enable
and run withenable_chunked_context=True
. Issue occurs similarly with different size llama models, and different max_num_tokens.Expected behavior
All outputs should be complete/not truncated, even under high load. Completion latency should increase under high load, but outputs shouldn't be affected.
actual behavior
Errors for many requests during inference, which return incomplete/truncated outputs:
additional notes
I tried this with multiple llama based models and got the same error. enable_kv_cache_reuse=True seems to make the errors happen more frequently.