Assertion failed: Failed to deserialize cuda engine.

System Info

22.04 Ubuntu
NVIDIA driver 550.67
CUDA version 12.4
NVIDIA RTX 4090

Who can help?

@kaiyux @byshiue

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

# 1. Build TensorRT engine
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/llama
pip install -r requirements.txt
python convert_checkpoint.py --model_dir ./Llama3-ChatQA-1.5-8B \
                             --output_dir ./Llama3-ChatQA-1.5-8B-TensorRT/ \
                             --dtype float16 \
                             --weight_only_precision int8

trtllm-build --checkpoint_dir ./Llama3-ChatQA-1.5-8B-TensorRT/ \
            --output_dir ./Llama3-ChatQA-1.5-8B-compiled/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16 \
            --max_input_len 128000

docker run --gpus=1 --rm --net=host -v .:/models nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 tritonserver --model-repository=/models/inflight-batch-llm

# 2. Build triton server
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive

DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

Expected behavior

I would expect the tensorrt engine to work with the triton inference server

actual behavior

triton-models-1  | =============================
triton-models-1  | == Triton Inference Server ==
triton-models-1  | =============================
triton-models-1  | 
triton-models-1  | NVIDIA Release 24.04 (build 90085237)
triton-models-1  | Triton Server Version 2.45.0
triton-models-1  | 
triton-models-1  | Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
triton-models-1  | 
triton-models-1  | Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
triton-models-1  | 
triton-models-1  | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
triton-models-1  | By pulling and using the container, you accept the terms and conditions of this license:
triton-models-1  | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
triton-models-1  | 
triton-models-1  | I0625 13:46:27.239906 1 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7dc288000000' with size 268435456
triton-models-1  | I0625 13:46:27.240032 1 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
triton-models-1  | I0625 13:46:27.242694 1 model_lifecycle.cc:469] loading: preprocessing:1
triton-models-1  | I0625 13:46:27.242713 1 model_lifecycle.cc:469] loading: tensorrt_llm:1
triton-models-1  | I0625 13:46:27.242721 1 model_lifecycle.cc:469] loading: postprocessing:1
triton-models-1  | [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
triton-models-1  | [TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
triton-models-1  | [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
triton-models-1  | [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
triton-models-1  | [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
triton-models-1  | [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
triton-models-1  | [TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
triton-models-1  | [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
triton-models-1  | [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
triton-models-1  | [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
triton-models-1  | [TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
triton-models-1  | [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
triton-models-1  | [TensorRT-LLM][WARNING] Decoupled mode with a batch scheduler policy other than guaranteed_no_evict requires building the model with use_paged_context_fmha and setting enable_chunked_context to true. The batch scheduler policy will be set to guaranteed_no_evict since enable_chunked_context is false.
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
triton-models-1  | [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
triton-models-1  | [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
triton-models-1  | [TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
triton-models-1  | [TensorRT-LLM][INFO] Engine version 0.11.0.dev2024061800 found in the config file, assuming engine(s) built by new builder API.
triton-models-1  | [TensorRT-LLM][INFO] Parameter layer_types cannot be read from json:
triton-models-1  | [TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'layer_types' not found
triton-models-1  | [TensorRT-LLM][INFO] Parameter has_position_embedding cannot be read from json:
triton-models-1  | [TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_position_embedding' not found
triton-models-1  | [TensorRT-LLM][INFO] Parameter has_token_type_embedding cannot be read from json:
triton-models-1  | [TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_token_type_embedding' not found
triton-models-1  | [TensorRT-LLM][INFO] [json.exception.type_error.302] type must be string, but is null
triton-models-1  | [TensorRT-LLM][INFO] Optional value for parameter quant_algo will not be set.
triton-models-1  | [TensorRT-LLM][INFO] [json.exception.type_error.302] type must be string, but is null
triton-models-1  | [TensorRT-LLM][INFO] Optional value for parameter kv_cache_quant_algo will not be set.
triton-models-1  | [TensorRT-LLM][INFO] Initializing MPI with thread mode 3
triton-models-1  | [TensorRT-LLM][INFO] Initialized MPI
triton-models-1  | [TensorRT-LLM][INFO] MPI size: 1, rank: 0
triton-models-1  | [TensorRT-LLM][INFO] Rank 0 is using GPU 0
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
triton-models-1  | [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
triton-models-1  | [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
triton-models-1  | I0625 13:46:29.580926 1 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
triton-models-1  | I0625 13:46:29.956292 1 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
triton-models-1  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
triton-models-1  | I0625 13:46:31.095733 1 model_lifecycle.cc:835] successfully loaded 'preprocessing'
triton-models-1  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
triton-models-1  | I0625 13:46:31.215500 1 model_lifecycle.cc:835] successfully loaded 'postprocessing'
triton-models-1  | [TensorRT-LLM][INFO] Loaded engine size: 15324 MiB
triton-models-1  | [TensorRT-LLM][ERROR] 6: The engine plan file is not compatible with this version of TensorRT, expecting library version 10.0.1.6 got 10.1.0.27, please rebuild.
triton-models-1  | [TensorRT-LLM][ERROR] 2: [engine.cpp::deserializeEngine::1312] Error Code 2: Internal Error (Assertion engine->deserialize(start, size, allocator, runtime) failed. )
triton-models-1  | E0625 13:46:36.561651 1 backend_model.cc:691] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:129)
triton-models-1  | 1       0x7dc2c46d5110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
triton-models-1  | 2       0x7dc1fb1c9fa2 /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x73cfa2) [0x7dc1fb1c9fa2]
triton-models-1  | 3       0x7dc1fd18cce2 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 962
triton-models-1  | 4       0x7dc1fd1afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420
triton-models-1  | 5       0x7dc1fd1b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304
triton-models-1  | 6       0x7dc1fd1b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764
triton-models-1  | 7       0x7dc1fd1aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64
triton-models-1  | 8       0x7dc2c46e0182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538
triton-models-1  | 9       0x7dc2c46e0782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66
triton-models-1  | 10      0x7dc2d8fd38f5 TRITONBACKEND_ModelInstanceInitialize + 101
triton-models-1  | 11      0x7dc2d7524086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x7dc2d7524086]
triton-models-1  | 12      0x7dc2d75252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x7dc2d75252c6]
triton-models-1  | 13      0x7dc2d75078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x7dc2d75078d5]
triton-models-1  | 14      0x7dc2d7507f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x7dc2d7507f16]
triton-models-1  | 15      0x7dc2d751480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x7dc2d751480d]
triton-models-1  | 16      0x7dc2d6b75ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7dc2d6b75ee8]
triton-models-1  | 17      0x7dc2d74fe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x7dc2d74fe64b]
triton-models-1  | 18      0x7dc2d750f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x7dc2d750f4f5]
triton-models-1  | 19      0x7dc2d7513c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x7dc2d7513c2e]
triton-models-1  | 20      0x7dc2d7608318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x7dc2d7608318]
triton-models-1  | 21      0x7dc2d760bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x7dc2d760bbfc]
triton-models-1  | 22      0x7dc2d7767a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x7dc2d7767a02]
triton-models-1  | 23      0x7dc2d6de1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7dc2d6de1253]
triton-models-1  | 24      0x7dc2d6b70ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7dc2d6b70ac3]
triton-models-1  | 25      0x7dc2d6c01a04 clone + 68
triton-models-1  | E0625 13:46:36.561729 1 model_lifecycle.cc:638] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:129)
triton-models-1  | 1       0x7dc2c46d5110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
triton-models-1  | 2       0x7dc1fb1c9fa2 /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x73cfa2) [0x7dc1fb1c9fa2]
triton-models-1  | 3       0x7dc1fd18cce2 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 962
triton-models-1  | 4       0x7dc1fd1afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420
triton-models-1  | 5       0x7dc1fd1b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304
triton-models-1  | 6       0x7dc1fd1b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764
triton-models-1  | 7       0x7dc1fd1aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64
triton-models-1  | 8       0x7dc2c46e0182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538
triton-models-1  | 9       0x7dc2c46e0782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66
triton-models-1  | 10      0x7dc2d8fd38f5 TRITONBACKEND_ModelInstanceInitialize + 101
triton-models-1  | 11      0x7dc2d7524086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x7dc2d7524086]
triton-models-1  | 12      0x7dc2d75252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x7dc2d75252c6]
triton-models-1  | 13      0x7dc2d75078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x7dc2d75078d5]
triton-models-1  | 14      0x7dc2d7507f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x7dc2d7507f16]
triton-models-1  | 15      0x7dc2d751480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x7dc2d751480d]
triton-models-1  | 16      0x7dc2d6b75ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7dc2d6b75ee8]
triton-models-1  | 17      0x7dc2d74fe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x7dc2d74fe64b]
triton-models-1  | 18      0x7dc2d750f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x7dc2d750f4f5]
triton-models-1  | 19      0x7dc2d7513c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x7dc2d7513c2e]
triton-models-1  | 20      0x7dc2d7608318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x7dc2d7608318]
triton-models-1  | 21      0x7dc2d760bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x7dc2d760bbfc]
triton-models-1  | 22      0x7dc2d7767a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x7dc2d7767a02]
triton-models-1  | 23      0x7dc2d6de1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7dc2d6de1253]
triton-models-1  | 24      0x7dc2d6b70ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7dc2d6b70ac3]
triton-models-1  | 25      0x7dc2d6c01a04 clone + 68
triton-models-1  | I0625 13:46:36.561751 1 model_lifecycle.cc:773] failed to load 'tensorrt_llm'
triton-models-1  | E0625 13:46:36.561897 1 model_repository_manager.cc:579] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:129)
triton-models-1  | 1       0x7dc2c46d5110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
triton-models-1  | 2       0x7dc1fb1c9fa2 /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x73cfa2) [0x7dc1fb1c9fa2]
triton-models-1  | 3       0x7dc1fd18cce2 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 962
triton-models-1  | 4       0x7dc1fd1afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420
triton-models-1  | 5       0x7dc1fd1b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304
triton-models-1  | 6       0x7dc1fd1b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764
triton-models-1  | 7       0x7dc1fd1aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64
triton-models-1  | 8       0x7dc2c46e0182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538
triton-models-1  | 9       0x7dc2c46e0782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66
triton-models-1  | 10      0x7dc2d8fd38f5 TRITONBACKEND_ModelInstanceInitialize + 101
triton-models-1  | 11      0x7dc2d7524086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x7dc2d7524086]
triton-models-1  | 12      0x7dc2d75252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x7dc2d75252c6]
triton-models-1  | 13      0x7dc2d75078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x7dc2d75078d5]
triton-models-1  | 14      0x7dc2d7507f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x7dc2d7507f16]
triton-models-1  | 15      0x7dc2d751480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x7dc2d751480d]
triton-models-1  | 16      0x7dc2d6b75ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7dc2d6b75ee8]
triton-models-1  | 17      0x7dc2d74fe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x7dc2d74fe64b]
triton-models-1  | 18      0x7dc2d750f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x7dc2d750f4f5]
triton-models-1  | 19      0x7dc2d7513c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x7dc2d7513c2e]
triton-models-1  | 20      0x7dc2d7608318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x7dc2d7608318]
triton-models-1  | 21      0x7dc2d760bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x7dc2d760bbfc]
triton-models-1  | 22      0x7dc2d7767a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x7dc2d7767a02]
triton-models-1  | 23      0x7dc2d6de1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7dc2d6de1253]
triton-models-1  | 24      0x7dc2d6b70ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7dc2d6b70ac3]
triton-models-1  | 25      0x7dc2d6c01a04 clone + 68;
triton-models-1  | I0625 13:46:36.561930 1 server.cc:607] 
triton-models-1  | +------------------+------+
triton-models-1  | | Repository Agent | Path |
triton-models-1  | +------------------+------+
triton-models-1  | +------------------+------+
triton-models-1  | 
triton-models-1  | I0625 13:46:36.561945 1 server.cc:634] 
triton-models-1  | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Backend     | Path                                                            | Config                                                                                                                                                        |
triton-models-1  | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
triton-models-1  | | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
triton-models-1  | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0625 13:46:36.561994 1 server.cc:677] 
triton-models-1  | +----------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Model          | Version | Status                                                                                                                                                                                                                                                                                                                                                     |
triton-models-1  | +----------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | postprocessing | 1       | READY                                                                                                                                                                                                                                                                                                                                                      |
triton-models-1  | | preprocessing  | 1       | READY                                                                                                                                                                                                                                                                                                                                                      |
triton-models-1  | | tensorrt_llm   | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:129)                                                                                                                                      |
triton-models-1  | |                |         | 1       0x7dc2c46d5110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102                                                                                                                                                                                                                                                 |
triton-models-1  | |                |         | 2       0x7dc1fb1c9fa2 /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x73cfa2) [0x7dc1fb1c9fa2]                                                                                                                                                                                                                                             |
triton-models-1  | |                |         | 3       0x7dc1fd18cce2 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 962 |
triton-models-1  | |                |         | 4       0x7dc1fd1afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420                                                                                             |
triton-models-1  | |                |         | 5       0x7dc1fd1b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304                                          |
triton-models-1  | |                |         | 6       0x7dc1fd1b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764                                                                                                                    |
triton-models-1  | |                |         | 7       0x7dc1fd1aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64                                                                                                                                                                     |
triton-models-1  | |                |         | 8       0x7dc2c46e0182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538                                                                                                                                                              |
triton-models-1  | |                |         | 9       0x7dc2c46e0782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66                                                                                                               |
triton-models-1  | |                |         | 10      0x7dc2d8fd38f5 TRITONBACKEND_ModelInstanceInitialize + 101                                                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 11      0x7dc2d7524086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x7dc2d7524086]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 12      0x7dc2d75252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x7dc2d75252c6]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 13      0x7dc2d75078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x7dc2d75078d5]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 14      0x7dc2d7507f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x7dc2d7507f16]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 15      0x7dc2d751480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x7dc2d751480d]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 16      0x7dc2d6b75ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7dc2d6b75ee8]                                                                                                                                                                                                                                                                      |
triton-models-1  | |                |         | 17      0x7dc2d74fe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x7dc2d74fe64b]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 18      0x7dc2d750f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x7dc2d750f4f5]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 19      0x7dc2d7513c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x7dc2d7513c2e]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 20      0x7dc2d7608318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x7dc2d7608318]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 21      0x7dc2d760bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x7dc2d760bbfc]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 22      0x7dc2d7767a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x7dc2d7767a02]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 23      0x7dc2d6de1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7dc2d6de1253]                                                                                                                                                                                                                                                                 |
triton-models-1  | |                |         | 24      0x7dc2d6b70ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7dc2d6b70ac3]                                                                                                                                                                                                                                                                      |
triton-models-1  | |                |         | 25      0x7dc2d6c01a04 clone + 68                                                                                                                                                                                                                                                                                                                          |
triton-models-1  | +----------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0625 13:46:36.589613 1 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090 Laptop GPU
triton-models-1  | I0625 13:46:36.591065 1 metrics.cc:770] Collecting CPU metrics
triton-models-1  | I0625 13:46:36.591159 1 tritonserver.cc:2538] 
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Option                           | Value                                                                                                                                                                                                           |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | server_id                        | triton                                                                                                                                                                                                          |
triton-models-1  | | server_version                   | 2.45.0                                                                                                                                                                                                          |
triton-models-1  | | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
triton-models-1  | | model_repository_path[0]         | /models/inflight-batch-llm                                                                                                                                                                                      |
triton-models-1  | | model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
triton-models-1  | | strict_model_config              | 0                                                                                                                                                                                                               |
triton-models-1  | | rate_limit                       | OFF                                                                                                                                                                                                             |
triton-models-1  | | pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
triton-models-1  | | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
triton-models-1  | | min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
triton-models-1  | | strict_readiness                 | 1                                                                                                                                                                                                               |
triton-models-1  | | exit_timeout                     | 30                                                                                                                                                                                                              |
triton-models-1  | | cache_enabled                    | 0                                                                                                                                                                                                               |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0625 13:46:36.591164 1 server.cc:307] Waiting for in-flight requests to complete.
triton-models-1  | I0625 13:46:36.591170 1 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences
triton-models-1  | I0625 13:46:36.591343 1 server.cc:338] All models are stopped, unloading models
triton-models-1  | I0625 13:46:36.591345 1 server.cc:347] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
triton-models-1  | I0625 13:46:37.591527 1 server.cc:347] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
triton-models-1  | W0625 13:46:37.596849 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1  | Cleaning up...
triton-models-1  | I0625 13:46:37.767201 1 model_lifecycle.cc:620] successfully unloaded 'preprocessing' version 1
triton-models-1  | Cleaning up...
triton-models-1  | I0625 13:46:38.087645 1 model_lifecycle.cc:620] successfully unloaded 'postprocessing' version 1
triton-models-1  | I0625 13:46:38.591780 1 server.cc:347] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
triton-models-1  | W0625 13:46:38.603621 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1  | error: creating server: Internal - failed to load all models
triton-models-1  | W0625 13:46:39.604709 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1 exited with code 1

additional notes

Both triton server and tensorrt engine built have the same TensorRT-LLM version at commit 2a115dae84f13daaa54727534daa837c534eceb4 Model used: Llama3-ChatQA-1.5-8B

NVIDIA / TensorRT-LLM

Assertion failed: Failed to deserialize cuda engine. #1838