NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.29k stars 925 forks source link

Assertion failed: Failed to deserialize cuda engine. #1838

Closed jasonngap1 closed 3 months ago

jasonngap1 commented 3 months ago

System Info

Who can help?

@kaiyux @byshiue

Information

Tasks

Reproduction

# 1. Build TensorRT engine
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/llama
pip install -r requirements.txt
python convert_checkpoint.py --model_dir ./Llama3-ChatQA-1.5-8B \
                             --output_dir ./Llama3-ChatQA-1.5-8B-TensorRT/ \
                             --dtype float16 \
                             --weight_only_precision int8

trtllm-build --checkpoint_dir ./Llama3-ChatQA-1.5-8B-TensorRT/ \
            --output_dir ./Llama3-ChatQA-1.5-8B-compiled/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16 \
            --max_input_len 128000

docker run --gpus=1 --rm --net=host -v .:/models nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 tritonserver --model-repository=/models/inflight-batch-llm

# 2. Build triton server
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive

DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

Expected behavior

I would expect the tensorrt engine to work with the triton inference server

actual behavior

triton-models-1  | =============================
triton-models-1  | == Triton Inference Server ==
triton-models-1  | =============================
triton-models-1  | 
triton-models-1  | NVIDIA Release 24.04 (build 90085237)
triton-models-1  | Triton Server Version 2.45.0
triton-models-1  | 
triton-models-1  | Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
triton-models-1  | 
triton-models-1  | Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
triton-models-1  | 
triton-models-1  | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
triton-models-1  | By pulling and using the container, you accept the terms and conditions of this license:
triton-models-1  | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
triton-models-1  | 
triton-models-1  | I0625 13:46:27.239906 1 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7dc288000000' with size 268435456
triton-models-1  | I0625 13:46:27.240032 1 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
triton-models-1  | I0625 13:46:27.242694 1 model_lifecycle.cc:469] loading: preprocessing:1
triton-models-1  | I0625 13:46:27.242713 1 model_lifecycle.cc:469] loading: tensorrt_llm:1
triton-models-1  | I0625 13:46:27.242721 1 model_lifecycle.cc:469] loading: postprocessing:1
triton-models-1  | [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
triton-models-1  | [TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
triton-models-1  | [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
triton-models-1  | [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
triton-models-1  | [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
triton-models-1  | [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
triton-models-1  | [TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0
triton-models-1  | [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
triton-models-1  | [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
triton-models-1  | [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
triton-models-1  | [TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
triton-models-1  | [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
triton-models-1  | [TensorRT-LLM][WARNING] Decoupled mode with a batch scheduler policy other than guaranteed_no_evict requires building the model with use_paged_context_fmha and setting enable_chunked_context to true. The batch scheduler policy will be set to guaranteed_no_evict since enable_chunked_context is false.
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
triton-models-1  | [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search, medusa}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
triton-models-1  | [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
triton-models-1  | [TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
triton-models-1  | [TensorRT-LLM][INFO] Engine version 0.11.0.dev2024061800 found in the config file, assuming engine(s) built by new builder API.
triton-models-1  | [TensorRT-LLM][INFO] Parameter layer_types cannot be read from json:
triton-models-1  | [TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'layer_types' not found
triton-models-1  | [TensorRT-LLM][INFO] Parameter has_position_embedding cannot be read from json:
triton-models-1  | [TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_position_embedding' not found
triton-models-1  | [TensorRT-LLM][INFO] Parameter has_token_type_embedding cannot be read from json:
triton-models-1  | [TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_token_type_embedding' not found
triton-models-1  | [TensorRT-LLM][INFO] [json.exception.type_error.302] type must be string, but is null
triton-models-1  | [TensorRT-LLM][INFO] Optional value for parameter quant_algo will not be set.
triton-models-1  | [TensorRT-LLM][INFO] [json.exception.type_error.302] type must be string, but is null
triton-models-1  | [TensorRT-LLM][INFO] Optional value for parameter kv_cache_quant_algo will not be set.
triton-models-1  | [TensorRT-LLM][INFO] Initializing MPI with thread mode 3
triton-models-1  | [TensorRT-LLM][INFO] Initialized MPI
triton-models-1  | [TensorRT-LLM][INFO] MPI size: 1, rank: 0
triton-models-1  | [TensorRT-LLM][INFO] Rank 0 is using GPU 0
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2048
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
triton-models-1  | [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
triton-models-1  | [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
triton-models-1  | [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
triton-models-1  | I0625 13:46:29.580926 1 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
triton-models-1  | I0625 13:46:29.956292 1 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
triton-models-1  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
triton-models-1  | I0625 13:46:31.095733 1 model_lifecycle.cc:835] successfully loaded 'preprocessing'
triton-models-1  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
triton-models-1  | I0625 13:46:31.215500 1 model_lifecycle.cc:835] successfully loaded 'postprocessing'
triton-models-1  | [TensorRT-LLM][INFO] Loaded engine size: 15324 MiB
triton-models-1  | [TensorRT-LLM][ERROR] 6: The engine plan file is not compatible with this version of TensorRT, expecting library version 10.0.1.6 got 10.1.0.27, please rebuild.
triton-models-1  | [TensorRT-LLM][ERROR] 2: [engine.cpp::deserializeEngine::1312] Error Code 2: Internal Error (Assertion engine->deserialize(start, size, allocator, runtime) failed. )
triton-models-1  | E0625 13:46:36.561651 1 backend_model.cc:691] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:129)
triton-models-1  | 1       0x7dc2c46d5110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
triton-models-1  | 2       0x7dc1fb1c9fa2 /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x73cfa2) [0x7dc1fb1c9fa2]
triton-models-1  | 3       0x7dc1fd18cce2 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 962
triton-models-1  | 4       0x7dc1fd1afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420
triton-models-1  | 5       0x7dc1fd1b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304
triton-models-1  | 6       0x7dc1fd1b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764
triton-models-1  | 7       0x7dc1fd1aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64
triton-models-1  | 8       0x7dc2c46e0182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538
triton-models-1  | 9       0x7dc2c46e0782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66
triton-models-1  | 10      0x7dc2d8fd38f5 TRITONBACKEND_ModelInstanceInitialize + 101
triton-models-1  | 11      0x7dc2d7524086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x7dc2d7524086]
triton-models-1  | 12      0x7dc2d75252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x7dc2d75252c6]
triton-models-1  | 13      0x7dc2d75078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x7dc2d75078d5]
triton-models-1  | 14      0x7dc2d7507f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x7dc2d7507f16]
triton-models-1  | 15      0x7dc2d751480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x7dc2d751480d]
triton-models-1  | 16      0x7dc2d6b75ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7dc2d6b75ee8]
triton-models-1  | 17      0x7dc2d74fe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x7dc2d74fe64b]
triton-models-1  | 18      0x7dc2d750f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x7dc2d750f4f5]
triton-models-1  | 19      0x7dc2d7513c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x7dc2d7513c2e]
triton-models-1  | 20      0x7dc2d7608318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x7dc2d7608318]
triton-models-1  | 21      0x7dc2d760bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x7dc2d760bbfc]
triton-models-1  | 22      0x7dc2d7767a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x7dc2d7767a02]
triton-models-1  | 23      0x7dc2d6de1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7dc2d6de1253]
triton-models-1  | 24      0x7dc2d6b70ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7dc2d6b70ac3]
triton-models-1  | 25      0x7dc2d6c01a04 clone + 68
triton-models-1  | E0625 13:46:36.561729 1 model_lifecycle.cc:638] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:129)
triton-models-1  | 1       0x7dc2c46d5110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
triton-models-1  | 2       0x7dc1fb1c9fa2 /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x73cfa2) [0x7dc1fb1c9fa2]
triton-models-1  | 3       0x7dc1fd18cce2 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 962
triton-models-1  | 4       0x7dc1fd1afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420
triton-models-1  | 5       0x7dc1fd1b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304
triton-models-1  | 6       0x7dc1fd1b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764
triton-models-1  | 7       0x7dc1fd1aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64
triton-models-1  | 8       0x7dc2c46e0182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538
triton-models-1  | 9       0x7dc2c46e0782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66
triton-models-1  | 10      0x7dc2d8fd38f5 TRITONBACKEND_ModelInstanceInitialize + 101
triton-models-1  | 11      0x7dc2d7524086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x7dc2d7524086]
triton-models-1  | 12      0x7dc2d75252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x7dc2d75252c6]
triton-models-1  | 13      0x7dc2d75078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x7dc2d75078d5]
triton-models-1  | 14      0x7dc2d7507f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x7dc2d7507f16]
triton-models-1  | 15      0x7dc2d751480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x7dc2d751480d]
triton-models-1  | 16      0x7dc2d6b75ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7dc2d6b75ee8]
triton-models-1  | 17      0x7dc2d74fe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x7dc2d74fe64b]
triton-models-1  | 18      0x7dc2d750f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x7dc2d750f4f5]
triton-models-1  | 19      0x7dc2d7513c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x7dc2d7513c2e]
triton-models-1  | 20      0x7dc2d7608318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x7dc2d7608318]
triton-models-1  | 21      0x7dc2d760bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x7dc2d760bbfc]
triton-models-1  | 22      0x7dc2d7767a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x7dc2d7767a02]
triton-models-1  | 23      0x7dc2d6de1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7dc2d6de1253]
triton-models-1  | 24      0x7dc2d6b70ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7dc2d6b70ac3]
triton-models-1  | 25      0x7dc2d6c01a04 clone + 68
triton-models-1  | I0625 13:46:36.561751 1 model_lifecycle.cc:773] failed to load 'tensorrt_llm'
triton-models-1  | E0625 13:46:36.561897 1 model_repository_manager.cc:579] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:129)
triton-models-1  | 1       0x7dc2c46d5110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
triton-models-1  | 2       0x7dc1fb1c9fa2 /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x73cfa2) [0x7dc1fb1c9fa2]
triton-models-1  | 3       0x7dc1fd18cce2 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 962
triton-models-1  | 4       0x7dc1fd1afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420
triton-models-1  | 5       0x7dc1fd1b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304
triton-models-1  | 6       0x7dc1fd1b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764
triton-models-1  | 7       0x7dc1fd1aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64
triton-models-1  | 8       0x7dc2c46e0182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538
triton-models-1  | 9       0x7dc2c46e0782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66
triton-models-1  | 10      0x7dc2d8fd38f5 TRITONBACKEND_ModelInstanceInitialize + 101
triton-models-1  | 11      0x7dc2d7524086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x7dc2d7524086]
triton-models-1  | 12      0x7dc2d75252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x7dc2d75252c6]
triton-models-1  | 13      0x7dc2d75078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x7dc2d75078d5]
triton-models-1  | 14      0x7dc2d7507f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x7dc2d7507f16]
triton-models-1  | 15      0x7dc2d751480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x7dc2d751480d]
triton-models-1  | 16      0x7dc2d6b75ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7dc2d6b75ee8]
triton-models-1  | 17      0x7dc2d74fe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x7dc2d74fe64b]
triton-models-1  | 18      0x7dc2d750f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x7dc2d750f4f5]
triton-models-1  | 19      0x7dc2d7513c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x7dc2d7513c2e]
triton-models-1  | 20      0x7dc2d7608318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x7dc2d7608318]
triton-models-1  | 21      0x7dc2d760bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x7dc2d760bbfc]
triton-models-1  | 22      0x7dc2d7767a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x7dc2d7767a02]
triton-models-1  | 23      0x7dc2d6de1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7dc2d6de1253]
triton-models-1  | 24      0x7dc2d6b70ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7dc2d6b70ac3]
triton-models-1  | 25      0x7dc2d6c01a04 clone + 68;
triton-models-1  | I0625 13:46:36.561930 1 server.cc:607] 
triton-models-1  | +------------------+------+
triton-models-1  | | Repository Agent | Path |
triton-models-1  | +------------------+------+
triton-models-1  | +------------------+------+
triton-models-1  | 
triton-models-1  | I0625 13:46:36.561945 1 server.cc:634] 
triton-models-1  | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Backend     | Path                                                            | Config                                                                                                                                                        |
triton-models-1  | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
triton-models-1  | | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
triton-models-1  | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0625 13:46:36.561994 1 server.cc:677] 
triton-models-1  | +----------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Model          | Version | Status                                                                                                                                                                                                                                                                                                                                                     |
triton-models-1  | +----------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | postprocessing | 1       | READY                                                                                                                                                                                                                                                                                                                                                      |
triton-models-1  | | preprocessing  | 1       | READY                                                                                                                                                                                                                                                                                                                                                      |
triton-models-1  | | tensorrt_llm   | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:129)                                                                                                                                      |
triton-models-1  | |                |         | 1       0x7dc2c46d5110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102                                                                                                                                                                                                                                                 |
triton-models-1  | |                |         | 2       0x7dc1fb1c9fa2 /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x73cfa2) [0x7dc1fb1c9fa2]                                                                                                                                                                                                                                             |
triton-models-1  | |                |         | 3       0x7dc1fd18cce2 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 962 |
triton-models-1  | |                |         | 4       0x7dc1fd1afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420                                                                                             |
triton-models-1  | |                |         | 5       0x7dc1fd1b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304                                          |
triton-models-1  | |                |         | 6       0x7dc1fd1b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764                                                                                                                    |
triton-models-1  | |                |         | 7       0x7dc1fd1aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64                                                                                                                                                                     |
triton-models-1  | |                |         | 8       0x7dc2c46e0182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538                                                                                                                                                              |
triton-models-1  | |                |         | 9       0x7dc2c46e0782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66                                                                                                               |
triton-models-1  | |                |         | 10      0x7dc2d8fd38f5 TRITONBACKEND_ModelInstanceInitialize + 101                                                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 11      0x7dc2d7524086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x7dc2d7524086]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 12      0x7dc2d75252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x7dc2d75252c6]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 13      0x7dc2d75078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x7dc2d75078d5]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 14      0x7dc2d7507f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x7dc2d7507f16]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 15      0x7dc2d751480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x7dc2d751480d]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 16      0x7dc2d6b75ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7dc2d6b75ee8]                                                                                                                                                                                                                                                                      |
triton-models-1  | |                |         | 17      0x7dc2d74fe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x7dc2d74fe64b]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 18      0x7dc2d750f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x7dc2d750f4f5]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 19      0x7dc2d7513c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x7dc2d7513c2e]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 20      0x7dc2d7608318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x7dc2d7608318]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 21      0x7dc2d760bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x7dc2d760bbfc]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 22      0x7dc2d7767a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x7dc2d7767a02]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 23      0x7dc2d6de1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7dc2d6de1253]                                                                                                                                                                                                                                                                 |
triton-models-1  | |                |         | 24      0x7dc2d6b70ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7dc2d6b70ac3]                                                                                                                                                                                                                                                                      |
triton-models-1  | |                |         | 25      0x7dc2d6c01a04 clone + 68                                                                                                                                                                                                                                                                                                                          |
triton-models-1  | +----------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0625 13:46:36.589613 1 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090 Laptop GPU
triton-models-1  | I0625 13:46:36.591065 1 metrics.cc:770] Collecting CPU metrics
triton-models-1  | I0625 13:46:36.591159 1 tritonserver.cc:2538] 
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Option                           | Value                                                                                                                                                                                                           |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | server_id                        | triton                                                                                                                                                                                                          |
triton-models-1  | | server_version                   | 2.45.0                                                                                                                                                                                                          |
triton-models-1  | | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
triton-models-1  | | model_repository_path[0]         | /models/inflight-batch-llm                                                                                                                                                                                      |
triton-models-1  | | model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
triton-models-1  | | strict_model_config              | 0                                                                                                                                                                                                               |
triton-models-1  | | rate_limit                       | OFF                                                                                                                                                                                                             |
triton-models-1  | | pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
triton-models-1  | | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
triton-models-1  | | min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
triton-models-1  | | strict_readiness                 | 1                                                                                                                                                                                                               |
triton-models-1  | | exit_timeout                     | 30                                                                                                                                                                                                              |
triton-models-1  | | cache_enabled                    | 0                                                                                                                                                                                                               |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0625 13:46:36.591164 1 server.cc:307] Waiting for in-flight requests to complete.
triton-models-1  | I0625 13:46:36.591170 1 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences
triton-models-1  | I0625 13:46:36.591343 1 server.cc:338] All models are stopped, unloading models
triton-models-1  | I0625 13:46:36.591345 1 server.cc:347] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
triton-models-1  | I0625 13:46:37.591527 1 server.cc:347] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
triton-models-1  | W0625 13:46:37.596849 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1  | Cleaning up...
triton-models-1  | I0625 13:46:37.767201 1 model_lifecycle.cc:620] successfully unloaded 'preprocessing' version 1
triton-models-1  | Cleaning up...
triton-models-1  | I0625 13:46:38.087645 1 model_lifecycle.cc:620] successfully unloaded 'postprocessing' version 1
triton-models-1  | I0625 13:46:38.591780 1 server.cc:347] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
triton-models-1  | W0625 13:46:38.603621 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1  | error: creating server: Internal - failed to load all models
triton-models-1  | W0625 13:46:39.604709 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1 exited with code 1

additional notes

Both triton server and tensorrt engine built have the same TensorRT-LLM version at commit 2a115dae84f13daaa54727534daa837c534eceb4 Model used: Llama3-ChatQA-1.5-8B

nv-guomingz commented 3 months ago

As the error prompt mentioned triton-models-1 | [TensorRT-LLM][ERROR] 6: The engine plan file is not compatible with this version of TensorRT, expecting library version 10.0.1.6 got 10.1.0.27, please rebuild the engine with commit 9691e12bce7ae1c126c435a049eb516eb119486c.

jasonngap1 commented 3 months ago

As the error prompt mentioned triton-models-1 | [TensorRT-LLM][ERROR] 6: The engine plan file is not compatible with this version of TensorRT, expecting library version 10.0.1.6 got 10.1.0.27, please rebuild the engine with commit 9691e12.

Thank you for pointing this out. The commit mentioned seemed to be the latest commit which uses tensorrt version 10.1.0, but the triton server expects an older version 10.0.1 although I have built the server with the latest TensorRT-LLM. Is it possible to trace which commit i should build the engine from with tensorrt version 10.0.1, please?

nv-guomingz commented 3 months ago

As the description of https://github.com/NVIDIA/TensorRT-LLM/pull/1835 mentioned, the latest commit upgrade TRT to 10.1. So please try this one 2a115dae84f13daaa54727534daa837c534eceb4 which is the parent commit of HEAD.

jasonngap1 commented 3 months ago

As the description of #1835 mentioned, the latest commit upgrade TRT to 10.1. So please try this one 2a115da which is the parent commit of HEAD.

Thank you @nv-guomingz this resolved my issue. Closing this issue on my end.