Succeeded in Python runtime, but failed in C++ runtime

yjjuan commented 1 week ago

System Info

CPU architecture: x86_64
CPU/Host memory size (if known): 40G
GPU name :RTX 3060-6G
TensorRT-LLM branch or tag TensorRT-LLM commit: TensorRT-LLM version: 0.14.0.dev2024092400
Versions of TensorRT, AMMO, CUDA, cuBLAS, etc. used: cuda 12.4, tensorrt 10.3.0
Container used (if running TensorRT-LLM in a container): nvcr.io/nvidia/tensorrt:24.08-py3
NVIDIA driver version: 550.107.02
OS : ubuntu22.04

Who can help?

No response

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

tried to run inference of tinyllama engine with C++ runtime based on executor API

root@dbedfb3d0654:/workspace/TensorRT-LLM/examples/cpp/executor/build# ./executorExampleBasic ../../../llama/tinyllama-engine/
[TensorRT-LLM][INFO] ckpt0
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 22
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2103 MiB
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: sizeof(*this) <= buffer_size (/workspace/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplCommon.h:118)
1       0x7fd1368a3c66 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fd136a5c227 tensorrt_llm::kernels::jit::CubinObjRegistryTemplate<tensorrt_llm::kernels::XQAKernelFullHashKey, tensorrt_llm::kernels::XQAKernelFullHasher>::CubinObjRegistryTemplate(void const*, unsigned long) + 1047

Expected behavior

In contrast, the inference succeeded in python runtime with the same engine file

root@dbedfb3d0654:/workspace/TensorRT-LLM/examples# python3 run.py --engine_dir ./llama/tinyllama-engine/  --max_output_len 100 --tokenizer_dir ./llama/TinyLlama/TinyLlama_v1_1/ --input_text "How do I count to nine in French?"
[TensorRT-LLM] TensorRT-LLM version: 0.14.0.dev2024092400
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.14.0.dev2024092400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 22
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2103 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 360.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2098 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 346.15 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.41 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 5.79 GiB, available: 1.43 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 960
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 32
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.29 GiB for max tokens in paged KV cache (61440).
[10/04/2024-05:05:27] [TRT-LLM] [I] Load engine takes: 1.5954880714416504 sec
Input [Text 0]: "<s> How do I count to nine in French?"
Output [Text 0 Beam 0]: "How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How do I count to nine in French? How"
[TensorRT-LLM][INFO] Refreshed the MPI local session

actual behavior

There is sth wrong with C++ runtime since python runtime works

[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: sizeof(*this) <= buffer_size (/workspace/TensorRT-LLM/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplCommon.h:118)

additional notes

Maybe there is some error in how to use C++ runtime.

MartinMarciniszyn commented 4 days ago

@yjjuan , please provide a reproducer.

yjjuan commented 1 day ago

I built my trt engine with command: trtllm-build --checkpoint_dir TinyLlama_v1_1 --gemm_plugin float16 --output_dir tinyllama-engine/
Next, I tried to use your example script in cpp runtime to run tinyllamma : /executorExampleBasic ../../../llama/tinyllama-engine/

NVIDIA / TensorRT-LLM

Succeeded in Python runtime, but failed in C++ runtime #2294