NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.23k stars 913 forks source link

Chunked context incomplete outputs #1377

Open siddhatiwari opened 5 months ago

siddhatiwari commented 5 months ago

System Info

CPU architecture: x86_64 Host RAM: 1TB GPU: 8xH100 SXM Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend (nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 doesn't work for TRT LLM main branch?) TRT LLM v0.9 main branch (https://github.com/NVIDIA/TensorRT-LLM/commit/850b6fa1e710d25769f2b560d897d2bd424a645e) Driver Version: 535.161.07 CUDA Version: 12.2 OS: Ubuntu 22.04

Who can help?

@byshiue @Shixiaowei02

Information

Tasks

Reproduction

Running engine at high queries per second causes errors and incomplete output. With very low max_num_tokens, outputs are incomplete even at very low queries per second.

Build llama engine with use_paged_context_fmha=enable and run with enable_chunked_context=True. Issue occurs similarly with different size llama models, and different max_num_tokens.

python3 convert_checkpoint.py \
  --model_dir ./llama-70b \
  --output_dir ./llama-70b_tp2 \
  --dtype float16 \
  --tp_size 2

trtllm-build \
  --checkpoint_dir ./llama-70b_tp2 \
  --output_dir engines/llama-70b-1 \
  --gemm_plugin float16 \
  --max_batch_size 256 \
  --max_input_len 2048 \
  --max_output_len 512 \
  --gpt_attention_plugin float16 \
  --paged_kv_cache enable \
  --remove_input_padding enable \
  --multi_block_mode disable \
  --max_num_tokens 8192 \
  --context_fmha enable \
  --use_paged_context_fmha enable \
  --context_fmha_fp32_acc enable \
  --use_fused_mlp \
  --enable_xqa enable \
  --use_custom_all_reduce enable

Expected behavior

All outputs should be complete/not truncated, even under high load. Completion latency should increase under high load, but outputs shouldn't be affected.

actual behavior

Errors for many requests during inference, which return incomplete/truncated outputs:

[TensorRT-LLM][ERROR] Encountered error for requestId 7692: Encountered an error in forward function: slice 2044 exceeds buffer size 420
{"asctime": "2024-03-27 17:28:07,904", "levelname": "ERROR", "message": "Exception while reading stream response: {\"status\": \"error\", \"message\": \"in ensemble 'ensemble', Encountered error for requestId 7657: Encountered an error in forward function: slice 2044 exceeds buffer size 420\"}", "exc_info": "Traceback (most recent call last):\n  File \"/app/model_wrapper.py\", line 257, in write_response_to_queue\n    async for chunk in generator:\n  File \"/app/model/model.py\", line 116, in generate\n    async for i in result_iterator:\n  File \"/packages/client.py\", line 181, in infer\n    raise Exception(error_message)\nException: {\"status\": \"error\", \"message\": \"in ensemble 'ensemble', Encountered error for requestId 7657: Encountered an error in forward function: slice 2044 exceeds buffer size 420\"}"}

additional notes

I tried this with multiple llama based models and got the same error. enable_kv_cache_reuse=True seems to make the errors happen more frequently.

Shixiaowei02 commented 5 months ago

Thank you for finding the issue. I have started troubleshooting.

Tushar-ml commented 4 months ago

Any updates on this, I am also facing the same buffer size issue

skyCreateXian commented 4 months ago

Locating tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TokenPtr tensorrt_llm::batch_manager::TrtGptModelInflightBatching::decoderStepAsync(tensorrt_llm::batch_manager::RequestTable&, const ReqIdsVec&, const ReqIdsVec&) crashes, but the code may be closed source

siddhatiwari commented 3 months ago

Any updates on this? It would be great to see the full speedup from this feature https://github.com/NVIDIA/TensorRT-LLM/issues/317#issuecomment-1810841752

byshiue commented 3 months ago

Could you try this guide, which uses chunked context to run long context?