TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
TensorRT-LLM branch or tag TensorRT-LLM commit: TensorRT-LLM version: 0.14.0.dev2024092400
Versions of TensorRT, AMMO, CUDA, cuBLAS, etc. used: cuda 12.4, tensorrt 10.3.0
Container used (if running TensorRT-LLM in a container): nvcr.io/nvidia/tensorrt:24.08-py3
NVIDIA driver version: 550.107.02
OS : ubuntu22.04
Who can help?
No response
Information
[x] The official example scripts
[ ] My own modified scripts
Tasks
[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)
Reproduction
tried to run inference of tinyllama engine with C++ runtime based on executor API
root@dbedfb3d0654:/workspace/TensorRT-LLM/examples/cpp/executor/build# ./executorExampleBasic ../../../llama/tinyllama-engine/
[TensorRT-LLM][INFO] ckpt0
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 22
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2103 MiB
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: sizeof(*this) <= buffer_size (/workspace/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplCommon.h:118)
1 0x7fd1368a3c66 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2 0x7fd136a5c227 tensorrt_llm::kernels::jit::CubinObjRegistryTemplate<tensorrt_llm::kernels::XQAKernelFullHashKey, tensorrt_llm::kernels::XQAKernelFullHasher>::CubinObjRegistryTemplate(void const*, unsigned long) + 1047
Expected behavior
In contrast, the inference succeeded in python runtime with the same engine file
root@dbedfb3d0654:/workspace/TensorRT-LLM/examples# python3 run.py --engine_dir ./llama/tinyllama-engine/ --max_output_len 100 --tokenizer_dir ./llama/TinyLlama/TinyLlama_v1_1/ --input_text "How do I count to nine in French?"
[TensorRT-LLM] TensorRT-LLM version: 0.14.0.dev2024092400
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 22
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2103 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 360.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2098 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 346.15 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.41 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 5.79 GiB, available: 1.43 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 960
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 32
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.29 GiB for max tokens in paged KV cache (61440).
[10/04/2024-05:05:27] [TRT-LLM] [I] Load engine takes: 1.5954880714416504 sec
Input [Text 0]: "<s> How do I count to nine in French?"
Output [Text 0 Beam 0]: "How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How"
[TensorRT-LLM][INFO] Refreshed the MPI local session
actual behavior
There is sth wrong with C++ runtime since python runtime works
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
actual behavior
additional notes