NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

Model 'tensorrt_llm' loading failed with error: key 'use_context_fmha_for_generation' not found #1803

Closed jasonngap1 closed 6 days ago

jasonngap1 commented 1 week ago

System Info

Who can help?

@kaiyux @byshiue

Information

Tasks

Reproduction

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/llama
pip install -r requirements.txt
python convert_checkpoint.py --model_dir ./Llama3-ChatQA-1.5-8B \
                             --output_dir ./Llama3-ChatQA-1.5-8B-TensorRT/ \
                             --dtype float16 \
                             --weight_only_precision int8

trtllm-build --checkpoint_dir ./Llama3-ChatQA-1.5-8B-TensorRT/ \
            --output_dir ./Llama3-ChatQA-1.5-8B-compiled/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16 \
            --max_input_len 128000

docker run --gpus=1 --rm --net=host -v .:/models nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 tritonserver --model-repository=/models/inflight-batch-llm

Expected behavior

I would expect the tensorrt engine to work with the triton inference server

actual behavior

triton-models-1  | =============================
triton-models-1  | == Triton Inference Server ==
triton-models-1  | =============================
triton-models-1  | 
triton-models-1  | NVIDIA Release 24.05 (build 95110614)
triton-models-1  | Triton Server Version 2.46.0
triton-models-1  | 
triton-models-1  | Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
triton-models-1  | Copyright (c) 2014-2024 Facebook Inc.
triton-models-1  | Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
triton-models-1  | Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
triton-models-1  | Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
triton-models-1  | Copyright (c) 2011-2013 NYU                      (Clement Farabet)
triton-models-1  | Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
triton-models-1  | Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
triton-models-1  | Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
triton-models-1  | Copyright (c) 2015      Google Inc.
triton-models-1  | Copyright (c) 2015      Yangqing Jia
triton-models-1  | Copyright (c) 2013-2016 The Caffe contributors
triton-models-1  | All rights reserved.
triton-models-1  | 
triton-models-1  | Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
triton-models-1  | 
triton-models-1  | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
triton-models-1  | By pulling and using the container, you accept the terms and conditions of this license:
triton-models-1  | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
triton-models-1  | 
triton-models-1  | I0618 08:11:13.059145 1 pinned_memory_manager.cc:275] "Pinned memory pool is created at '0x7f5008000000' with size 268435456"
triton-models-1  | I0618 08:11:13.059275 1 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
triton-models-1  | I0618 08:11:13.060598 1 model_lifecycle.cc:472] "loading: preprocessing:1"
triton-models-1  | I0618 08:11:13.060621 1 model_lifecycle.cc:472] "loading: tensorrt_llm:1"
triton-models-1  | I0618 08:11:13.060629 1 model_lifecycle.cc:472] "loading: postprocessing:1"
triton-models-1  | [TensorRT-LLM][INFO] Initializing MPI with thread mode 3
triton-models-1  | [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
triton-models-1  | [TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
triton-models-1  | [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
triton-models-1  | [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
triton-models-1  | [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
triton-models-1  | [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
triton-models-1  | [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
triton-models-1  | [TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
triton-models-1  | [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
triton-models-1  | [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
triton-models-1  | [TensorRT-LLM][INFO] Engine version 0.11.0.dev2024061100 found in the config file, assuming engine(s) built by new builder API.
triton-models-1  | E0618 08:11:13.251485 1 backend_model.cc:692] "ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'use_context_fmha_for_generation' not found"
triton-models-1  | E0618 08:11:13.251518 1 model_lifecycle.cc:641] "failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'use_context_fmha_for_generation' not found"
triton-models-1  | I0618 08:11:13.251528 1 model_lifecycle.cc:776] "failed to load 'tensorrt_llm'"
triton-models-1  | I0618 08:11:15.508688 1 python_be.cc:2404] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
triton-models-1  | I0618 08:11:15.764787 1 python_be.cc:2404] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
triton-models-1  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
triton-models-1  | I0618 08:11:16.810643 1 model_lifecycle.cc:838] "successfully loaded 'preprocessing'"
triton-models-1  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
triton-models-1  | I0618 08:11:17.058231 1 model_lifecycle.cc:838] "successfully loaded 'postprocessing'"
triton-models-1  | E0618 08:11:17.058465 1 model_repository_manager.cc:614] "Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'use_context_fmha_for_generation' not found;"
triton-models-1  | I0618 08:11:17.058509 1 server.cc:606] 
triton-models-1  | +------------------+------+
triton-models-1  | | Repository Agent | Path |
triton-models-1  | +------------------+------+
triton-models-1  | +------------------+------+
triton-models-1  | 
triton-models-1  | I0618 08:11:17.058557 1 server.cc:633] 
triton-models-1  | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Backend     | Path                                                            | Config                                                                                                                                                        |
triton-models-1  | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
triton-models-1  | | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
triton-models-1  | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0618 08:11:17.058594 1 server.cc:676] 
triton-models-1  | +----------------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Model          | Version | Status                                                                                                                                                      |
triton-models-1  | +----------------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | postprocessing | 1       | READY                                                                                                                                                       |
triton-models-1  | | preprocessing  | 1       | READY                                                                                                                                                       |
triton-models-1  | | tensorrt_llm   | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'use_context_fmha_for_generation' not found |
triton-models-1  | +----------------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0618 08:11:17.099527 1 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090 Laptop GPU"
triton-models-1  | I0618 08:11:17.101077 1 metrics.cc:770] "Collecting CPU metrics"
triton-models-1  | I0618 08:11:17.101202 1 tritonserver.cc:2557] 
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Option                           | Value                                                                                                                                                                                                           |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | server_id                        | triton                                                                                                                                                                                                          |
triton-models-1  | | server_version                   | 2.46.0                                                                                                                                                                                                          |
triton-models-1  | | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
triton-models-1  | | model_repository_path[0]         | /models/inflight-batch-llm                                                                                                                                                                                      |
triton-models-1  | | model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
triton-models-1  | | strict_model_config              | 0                                                                                                                                                                                                               |
triton-models-1  | | model_config_name                |                                                                                                                                                                                                                 |
triton-models-1  | | rate_limit                       | OFF                                                                                                                                                                                                             |
triton-models-1  | | pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
triton-models-1  | | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
triton-models-1  | | min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
triton-models-1  | | strict_readiness                 | 1                                                                                                                                                                                                               |
triton-models-1  | | exit_timeout                     | 30                                                                                                                                                                                                              |
triton-models-1  | | cache_enabled                    | 0                                                                                                                                                                                                               |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0618 08:11:17.101241 1 server.cc:307] "Waiting for in-flight requests to complete."
triton-models-1  | I0618 08:11:17.101247 1 server.cc:323] "Timeout 30: Found 0 model versions that have in-flight inferences"
triton-models-1  | I0618 08:11:17.101570 1 server.cc:338] "All models are stopped, unloading models"
triton-models-1  | I0618 08:11:17.101579 1 server.cc:347] "Timeout 30: Found 2 live models and 0 in-flight non-inference requests"
triton-models-1  | I0618 08:11:18.101795 1 server.cc:347] "Timeout 29: Found 2 live models and 0 in-flight non-inference requests"
triton-models-1  | W0618 08:11:18.107242 1 metrics.cc:631] "Unable to get power limit for GPU 0. Status:Success, value:0.000000"
triton-models-1  | Cleaning up...
triton-models-1  | Cleaning up...
triton-models-1  | I0618 08:11:18.353797 1 model_lifecycle.cc:623] "successfully unloaded 'preprocessing' version 1"
triton-models-1  | I0618 08:11:18.479704 1 model_lifecycle.cc:623] "successfully unloaded 'postprocessing' version 1"
triton-models-1  | I0618 08:11:19.102114 1 server.cc:347] "Timeout 28: Found 0 live models and 0 in-flight non-inference requests"
triton-models-1  | W0618 08:11:19.119314 1 metrics.cc:631] "Unable to get power limit for GPU 0. Status:Success, value:0.000000"
triton-models-1  | error: creating server: Internal - failed to load all models
triton-models-1  | W0618 08:11:20.120275 1 metrics.cc:631] "Unable to get power limit for GPU 0. Status:Success, value:0.000000"

additional notes

Triton Inference Server used : nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 Model used: Llama3-ChatQA-1.5-8B

jasonngap1 commented 1 week ago

Update: used tensorrt_llm version 0.10.0 to convert checkpoints and compile model and the error does not show for triton inference server. However, another error occurred: Assertion failed: Failed to deserialize cuda engine when using the same version of docker image for inference server.

hijkzzz commented 1 week ago

I recommended using the commands in triton-trtllm repo: https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#build-the-docker-container to build the container.

nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 does not match the main branch of TRT-LLM

jasonngap1 commented 1 week ago

Thanks I have built the triton docker container using the main branch and compiled the model using the main branch of TensorRT-LLM but have met with this issue:

triton-models-1  | | tensorrt_llm   | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Attention window size (mMaxAttentionWindow) must be > 0 (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/trtGptModel.h:78)                                                                       |
triton-models-1  | |                |         | 1       0x78c9a8384110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102                                                                                                                                                                                                                                                |
triton-models-1  | |                |         | 2       0x78c8cd18869d tensorrt_llm::batch_manager::TrtGptModel::TrtGptModel(tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1981                                                                                                                  |
triton-models-1  | |                |         | 3       0x78c8cd18c95d tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 61 |
triton-models-1  | |                |         | 4       0x78c8cd1afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420                                                                                            |
triton-models-1  | |                |         | 5       0x78c8cd1b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304                                         |
triton-models-1  | |                |         | 6       0x78c8cd1b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764                                                                                                                   |
triton-models-1  | |                |         | 7       0x78c8cd1aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64                                                                                                                                                                    |
triton-models-1  | |                |         | 8       0x78c9a838f182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538                                                                                                                                                             |
triton-models-1  | |                |         | 9       0x78c9a838f782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66                                                                                                              |
triton-models-1  | |                |         | 10      0x78c9b1a2c8f5 TRITONBACKEND_ModelInstanceInitialize + 101                                                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 11      0x78c9aff24086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x78c9aff24086]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 12      0x78c9aff252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x78c9aff252c6]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 13      0x78c9aff078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x78c9aff078d5]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 14      0x78c9aff07f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x78c9aff07f16]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 15      0x78c9aff1480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x78c9aff1480d]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 16      0x78c9af575ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x78c9af575ee8]                                                                                                                                                                                                                                                                     |
triton-models-1  | |                |         | 17      0x78c9afefe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x78c9afefe64b]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 18      0x78c9aff0f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x78c9aff0f4f5]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 19      0x78c9aff13c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x78c9aff13c2e]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 20      0x78c9b0008318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x78c9b0008318]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 21      0x78c9b000bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x78c9b000bbfc]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 22      0x78c9b0167a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x78c9b0167a02]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 23      0x78c9af7e1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x78c9af7e1253]                                                                                                                                                                                                                                                                |
triton-models-1  | |                |         | 24      0x78c9af570ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x78c9af570ac3]                                                                                                                                                                                                                                                                     |
triton-models-1  | |                |         | 25      0x78c9af601a04 clone + 68                                                                                                                                                                                                                                                                                                                         |
triton-models-1  | +----------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0620 06:16:48.723059 1 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090 Laptop GPU
triton-models-1  | I0620 06:16:48.724456 1 metrics.cc:770] Collecting CPU metrics
triton-models-1  | I0620 06:16:48.724538 1 tritonserver.cc:2538] 
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Option                           | Value                                                                                                                                                                                                           |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | server_id                        | triton                                                                                                                                                                                                          |
triton-models-1  | | server_version                   | 2.45.0                                                                                                                                                                                                          |
triton-models-1  | | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
triton-models-1  | | model_repository_path[0]         | /models/inflight-batch-llm                                                                                                                                                                                      |
triton-models-1  | | model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
triton-models-1  | | strict_model_config              | 0                                                                                                                                                                                                               |
triton-models-1  | | rate_limit                       | OFF                                                                                                                                                                                                             |
triton-models-1  | | pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
triton-models-1  | | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
triton-models-1  | | min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
triton-models-1  | | strict_readiness                 | 1                                                                                                                                                                                                               |
triton-models-1  | | exit_timeout                     | 30                                                                                                                                                                                                              |
triton-models-1  | | cache_enabled                    | 0                                                                                                                                                                                                               |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0620 06:16:48.724541 1 server.cc:307] Waiting for in-flight requests to complete.
triton-models-1  | I0620 06:16:48.724545 1 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences
triton-models-1  | I0620 06:16:48.724715 1 server.cc:338] All models are stopped, unloading models
triton-models-1  | I0620 06:16:48.724718 1 server.cc:347] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
triton-models-1  | I0620 06:16:49.724821 1 server.cc:347] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
triton-models-1  | W0620 06:16:49.730014 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1  | Cleaning up...
triton-models-1  | I0620 06:16:49.899575 1 model_lifecycle.cc:620] successfully unloaded 'postprocessing' version 1
triton-models-1  | Cleaning up...
triton-models-1  | I0620 06:16:50.147048 1 model_lifecycle.cc:620] successfully unloaded 'preprocessing' version 1
triton-models-1  | I0620 06:16:50.724971 1 server.cc:347] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
triton-models-1  | W0620 06:16:50.737358 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1  | error: creating server: Internal - failed to load all models
triton-models-1  | W0620 06:16:51.738314 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
yh-yao commented 1 week ago

Thanks I have built the triton docker container using the main branch and compiled the model using the main branch of TensorRT-LLM but have met with this issue:

triton-models-1  | | tensorrt_llm   | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Attention window size (mMaxAttentionWindow) must be > 0 (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/trtGptModel.h:78)                                                                       |
triton-models-1  | |                |         | 1       0x78c9a8384110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102                                                                                                                                                                                                                                                |
triton-models-1  | |                |         | 2       0x78c8cd18869d tensorrt_llm::batch_manager::TrtGptModel::TrtGptModel(tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1981                                                                                                                  |
triton-models-1  | |                |         | 3       0x78c8cd18c95d tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 61 |
triton-models-1  | |                |         | 4       0x78c8cd1afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420                                                                                            |
triton-models-1  | |                |         | 5       0x78c8cd1b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304                                         |
triton-models-1  | |                |         | 6       0x78c8cd1b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764                                                                                                                   |
triton-models-1  | |                |         | 7       0x78c8cd1aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64                                                                                                                                                                    |
triton-models-1  | |                |         | 8       0x78c9a838f182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538                                                                                                                                                             |
triton-models-1  | |                |         | 9       0x78c9a838f782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66                                                                                                              |
triton-models-1  | |                |         | 10      0x78c9b1a2c8f5 TRITONBACKEND_ModelInstanceInitialize + 101                                                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 11      0x78c9aff24086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x78c9aff24086]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 12      0x78c9aff252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x78c9aff252c6]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 13      0x78c9aff078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x78c9aff078d5]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 14      0x78c9aff07f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x78c9aff07f16]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 15      0x78c9aff1480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x78c9aff1480d]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 16      0x78c9af575ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x78c9af575ee8]                                                                                                                                                                                                                                                                     |
triton-models-1  | |                |         | 17      0x78c9afefe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x78c9afefe64b]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 18      0x78c9aff0f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x78c9aff0f4f5]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 19      0x78c9aff13c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x78c9aff13c2e]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 20      0x78c9b0008318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x78c9b0008318]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 21      0x78c9b000bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x78c9b000bbfc]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 22      0x78c9b0167a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x78c9b0167a02]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 23      0x78c9af7e1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x78c9af7e1253]                                                                                                                                                                                                                                                                |
triton-models-1  | |                |         | 24      0x78c9af570ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x78c9af570ac3]                                                                                                                                                                                                                                                                     |
triton-models-1  | |                |         | 25      0x78c9af601a04 clone + 68                                                                                                                                                                                                                                                                                                                         |
triton-models-1  | +----------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0620 06:16:48.723059 1 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090 Laptop GPU
triton-models-1  | I0620 06:16:48.724456 1 metrics.cc:770] Collecting CPU metrics
triton-models-1  | I0620 06:16:48.724538 1 tritonserver.cc:2538] 
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Option                           | Value                                                                                                                                                                                                           |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | server_id                        | triton                                                                                                                                                                                                          |
triton-models-1  | | server_version                   | 2.45.0                                                                                                                                                                                                          |
triton-models-1  | | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
triton-models-1  | | model_repository_path[0]         | /models/inflight-batch-llm                                                                                                                                                                                      |
triton-models-1  | | model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
triton-models-1  | | strict_model_config              | 0                                                                                                                                                                                                               |
triton-models-1  | | rate_limit                       | OFF                                                                                                                                                                                                             |
triton-models-1  | | pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
triton-models-1  | | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
triton-models-1  | | min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
triton-models-1  | | strict_readiness                 | 1                                                                                                                                                                                                               |
triton-models-1  | | exit_timeout                     | 30                                                                                                                                                                                                              |
triton-models-1  | | cache_enabled                    | 0                                                                                                                                                                                                               |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0620 06:16:48.724541 1 server.cc:307] Waiting for in-flight requests to complete.
triton-models-1  | I0620 06:16:48.724545 1 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences
triton-models-1  | I0620 06:16:48.724715 1 server.cc:338] All models are stopped, unloading models
triton-models-1  | I0620 06:16:48.724718 1 server.cc:347] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
triton-models-1  | I0620 06:16:49.724821 1 server.cc:347] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
triton-models-1  | W0620 06:16:49.730014 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1  | Cleaning up...
triton-models-1  | I0620 06:16:49.899575 1 model_lifecycle.cc:620] successfully unloaded 'postprocessing' version 1
triton-models-1  | Cleaning up...
triton-models-1  | I0620 06:16:50.147048 1 model_lifecycle.cc:620] successfully unloaded 'preprocessing' version 1
triton-models-1  | I0620 06:16:50.724971 1 server.cc:347] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
triton-models-1  | W0620 06:16:50.737358 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1  | error: creating server: Internal - failed to load all models
triton-models-1  | W0620 06:16:51.738314 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000

I also face this issue.

yh-yao commented 1 week ago

Thanks I have built the triton docker container using the main branch and compiled the model using the main branch of TensorRT-LLM but have met with this issue:

triton-models-1  | | tensorrt_llm   | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Attention window size (mMaxAttentionWindow) must be > 0 (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/trtGptModel.h:78)                                                                       |
triton-models-1  | |                |         | 1       0x78c9a8384110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102                                                                                                                                                                                                                                                |
triton-models-1  | |                |         | 2       0x78c8cd18869d tensorrt_llm::batch_manager::TrtGptModel::TrtGptModel(tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1981                                                                                                                  |
triton-models-1  | |                |         | 3       0x78c8cd18c95d tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 61 |
triton-models-1  | |                |         | 4       0x78c8cd1afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420                                                                                            |
triton-models-1  | |                |         | 5       0x78c8cd1b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304                                         |
triton-models-1  | |                |         | 6       0x78c8cd1b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764                                                                                                                   |
triton-models-1  | |                |         | 7       0x78c8cd1aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64                                                                                                                                                                    |
triton-models-1  | |                |         | 8       0x78c9a838f182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538                                                                                                                                                             |
triton-models-1  | |                |         | 9       0x78c9a838f782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66                                                                                                              |
triton-models-1  | |                |         | 10      0x78c9b1a2c8f5 TRITONBACKEND_ModelInstanceInitialize + 101                                                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 11      0x78c9aff24086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x78c9aff24086]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 12      0x78c9aff252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x78c9aff252c6]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 13      0x78c9aff078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x78c9aff078d5]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 14      0x78c9aff07f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x78c9aff07f16]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 15      0x78c9aff1480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x78c9aff1480d]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 16      0x78c9af575ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x78c9af575ee8]                                                                                                                                                                                                                                                                     |
triton-models-1  | |                |         | 17      0x78c9afefe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x78c9afefe64b]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 18      0x78c9aff0f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x78c9aff0f4f5]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 19      0x78c9aff13c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x78c9aff13c2e]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 20      0x78c9b0008318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x78c9b0008318]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 21      0x78c9b000bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x78c9b000bbfc]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 22      0x78c9b0167a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x78c9b0167a02]                                                                                                                                                                                                                                                        |
triton-models-1  | |                |         | 23      0x78c9af7e1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x78c9af7e1253]                                                                                                                                                                                                                                                                |
triton-models-1  | |                |         | 24      0x78c9af570ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x78c9af570ac3]                                                                                                                                                                                                                                                                     |
triton-models-1  | |                |         | 25      0x78c9af601a04 clone + 68                                                                                                                                                                                                                                                                                                                         |
triton-models-1  | +----------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0620 06:16:48.723059 1 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090 Laptop GPU
triton-models-1  | I0620 06:16:48.724456 1 metrics.cc:770] Collecting CPU metrics
triton-models-1  | I0620 06:16:48.724538 1 tritonserver.cc:2538] 
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Option                           | Value                                                                                                                                                                                                           |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | server_id                        | triton                                                                                                                                                                                                          |
triton-models-1  | | server_version                   | 2.45.0                                                                                                                                                                                                          |
triton-models-1  | | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
triton-models-1  | | model_repository_path[0]         | /models/inflight-batch-llm                                                                                                                                                                                      |
triton-models-1  | | model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
triton-models-1  | | strict_model_config              | 0                                                                                                                                                                                                               |
triton-models-1  | | rate_limit                       | OFF                                                                                                                                                                                                             |
triton-models-1  | | pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
triton-models-1  | | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
triton-models-1  | | min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
triton-models-1  | | strict_readiness                 | 1                                                                                                                                                                                                               |
triton-models-1  | | exit_timeout                     | 30                                                                                                                                                                                                              |
triton-models-1  | | cache_enabled                    | 0                                                                                                                                                                                                               |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0620 06:16:48.724541 1 server.cc:307] Waiting for in-flight requests to complete.
triton-models-1  | I0620 06:16:48.724545 1 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences
triton-models-1  | I0620 06:16:48.724715 1 server.cc:338] All models are stopped, unloading models
triton-models-1  | I0620 06:16:48.724718 1 server.cc:347] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
triton-models-1  | I0620 06:16:49.724821 1 server.cc:347] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
triton-models-1  | W0620 06:16:49.730014 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1  | Cleaning up...
triton-models-1  | I0620 06:16:49.899575 1 model_lifecycle.cc:620] successfully unloaded 'postprocessing' version 1
triton-models-1  | Cleaning up...
triton-models-1  | I0620 06:16:50.147048 1 model_lifecycle.cc:620] successfully unloaded 'preprocessing' version 1
triton-models-1  | I0620 06:16:50.724971 1 server.cc:347] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
triton-models-1  | W0620 06:16:50.737358 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1  | error: creating server: Internal - failed to load all models
triton-models-1  | W0620 06:16:51.738314 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000

I also face this issue.

To those who see this issue, it is due to the mismatch of tensorrt in TensorRT_LLM and the tensorrt in tensorrtllm_backend. Please git pull first then compile the model by the most recent version of TenroRT_LLM.

jasonngap1 commented 5 days ago

@yh-yao Hi is it possible to please share with me how you resolved this issue in detail? I have updated both TensorRT_LLM and tensorrtllm_backend, compiled the model again, but have met with this issue below. The tensorrt version i have installed is 10.0.1 and tensorrt-llm version is 0.11.0.dev2024061800.

triton-models-1  | | tensorrt_llm   | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:129)                                                                                                                                      |
triton-models-1  | |                |         | 1       0x72269c65d110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102                                                                                                                                                                                                                                                 |
triton-models-1  | |                |         | 2       0x7225c71c9fa2 /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x73cfa2) [0x7225c71c9fa2]                                                                                                                                                                                                                                             |
triton-models-1  | |                |         | 3       0x7225c918cce2 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 962 |
triton-models-1  | |                |         | 4       0x7225c91afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420                                                                                             |
triton-models-1  | |                |         | 5       0x7225c91b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304                                          |
triton-models-1  | |                |         | 6       0x7225c91b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764                                                                                                                    |
triton-models-1  | |                |         | 7       0x7225c91aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64                                                                                                                                                                     |
triton-models-1  | |                |         | 8       0x72269c668182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538                                                                                                                                                              |
triton-models-1  | |                |         | 9       0x72269c668782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66                                                                                                               |
triton-models-1  | |                |         | 10      0x7226afd338f5 TRITONBACKEND_ModelInstanceInitialize + 101                                                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 11      0x7226ae324086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x7226ae324086]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 12      0x7226ae3252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x7226ae3252c6]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 13      0x7226ae3078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x7226ae3078d5]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 14      0x7226ae307f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x7226ae307f16]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 15      0x7226ae31480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x7226ae31480d]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 16      0x7226ad975ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7226ad975ee8]                                                                                                                                                                                                                                                                      |
triton-models-1  | |                |         | 17      0x7226ae2fe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x7226ae2fe64b]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 18      0x7226ae30f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x7226ae30f4f5]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 19      0x7226ae313c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x7226ae313c2e]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 20      0x7226ae408318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x7226ae408318]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 21      0x7226ae40bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x7226ae40bbfc]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 22      0x7226ae567a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x7226ae567a02]                                                                                                                                                                                                                                                         |
triton-models-1  | |                |         | 23      0x7226adbe1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7226adbe1253]                                                                                                                                                                                                                                                                 |
triton-models-1  | |                |         | 24      0x7226ad970ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7226ad970ac3]                                                                                                                                                                                                                                                                      |
triton-models-1  | |                |         | 25      0x7226ada01a04 clone + 68                                                                                                                                                                                                                                                                                                                          |
triton-models-1  | +----------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0624 00:59:58.653384 1 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090 Laptop GPU
triton-models-1  | I0624 00:59:58.654783 1 metrics.cc:770] Collecting CPU metrics
triton-models-1  | I0624 00:59:58.654885 1 tritonserver.cc:2538] 
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Option                           | Value                                                                                                                                                                                                           |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | server_id                        | triton                                                                                                                                                                                                          |
triton-models-1  | | server_version                   | 2.45.0                                                                                                                                                                                                          |
triton-models-1  | | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
triton-models-1  | | model_repository_path[0]         | /models/inflight-batch-llm                                                                                                                                                                                      |
triton-models-1  | | model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
triton-models-1  | | strict_model_config              | 0                                                                                                                                                                                                               |
triton-models-1  | | rate_limit                       | OFF                                                                                                                                                                                                             |
triton-models-1  | | pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
triton-models-1  | | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
triton-models-1  | | min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
triton-models-1  | | strict_readiness                 | 1                                                                                                                                                                                                               |
triton-models-1  | | exit_timeout                     | 30                                                                                                                                                                                                              |
triton-models-1  | | cache_enabled                    | 0                                                                                                                                                                                                               |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0624 00:59:58.654890 1 server.cc:307] Waiting for in-flight requests to complete.
triton-models-1  | I0624 00:59:58.654894 1 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences
triton-models-1  | I0624 00:59:58.655094 1 server.cc:338] All models are stopped, unloading models
triton-models-1  | I0624 00:59:58.655098 1 server.cc:347] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
triton-models-1  | I0624 00:59:59.655307 1 server.cc:347] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
triton-models-1  | W0624 00:59:59.660270 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1  | Cleaning up...
triton-models-1  | I0624 00:59:59.885307 1 model_lifecycle.cc:620] successfully unloaded 'preprocessing' version 1
triton-models-1  | Cleaning up...
triton-models-1  | I0624 01:00:00.173149 1 model_lifecycle.cc:620] successfully unloaded 'postprocessing' version 1
triton-models-1  | I0624 01:00:00.655472 1 server.cc:347] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
triton-models-1  | W0624 01:00:00.671229 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1  | error: creating server: Internal - failed to load all models
triton-models-1  | W0624 01:00:01.672596 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
jasonngap1 commented 5 days ago

@yh-yao did you use the following steps (including base image version) to build the triton server?

# Prepare the TRT-LLM base image using the dockerfile from tensorrtllm_backend.
cd tensorrtllm_backend
# Specify the build args for the dockerfile.
BASE_IMAGE=nvcr.io/nvidia/pytorch:24.03-py3
TRT_VERSION=10.0.1.6
TRT_URL_x86=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.1/tars/TensorRT-10.0.1.6.Linux.x86_64-gnu.cuda-12.4.tar.gz
TRT_URL_ARM=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.1/tars/TensorRT-10.0.1.6.ubuntu-22.04.aarch64-gnu.cuda-12.4.tar.gz

docker build -t trtllm_base \
             --build-arg BASE_IMAGE="${BASE_IMAGE}" \
             --build-arg TRT_VER="${TRT_VERSION}" \
             --build-arg RELEASE_URL_TRT_x86="${TRT_URL_x86}" \
             --build-arg RELEASE_URL_TRT_ARM="${TRT_URL_ARM}" \
             -f dockerfile/Dockerfile.triton.trt_llm_backend .

# Run the build script from Triton Server repo. The flags for some features or
# endpoints can be removed if not needed. Please refer to the support matrix to
# see the aligned versions: https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html
TRTLLM_BASE_IMAGE=trtllm_base
TENSORRTLLM_BACKEND_REPO_TAG=rel
PYTHON_BACKEND_REPO_TAG=r24.04

cd server
./build.py -v --no-container-interactive --enable-logging --enable-stats --enable-tracing \
              --enable-metrics --enable-gpu-metrics --enable-cpu-metrics \
              --filesystem=gcs --filesystem=s3 --filesystem=azure_storage \
              --endpoint=http --endpoint=grpc --endpoint=sagemaker --endpoint=vertex-ai \
              --backend=ensemble --enable-gpu --endpoint=http --endpoint=grpc \
              --no-container-pull \
              --image=base,${TRTLLM_BASE_IMAGE} \
              --backend=tensorrtllm:${TENSORRTLLM_BACKEND_REPO_TAG} \
              --backend=python:${PYTHON_BACKEND_REPO_TAG}
yh-yao commented 4 days ago

@yh-yao did you use the following steps (including base image version) to build the triton server?

# Prepare the TRT-LLM base image using the dockerfile from tensorrtllm_backend.
cd tensorrtllm_backend
# Specify the build args for the dockerfile.
BASE_IMAGE=nvcr.io/nvidia/pytorch:24.03-py3
TRT_VERSION=10.0.1.6
TRT_URL_x86=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.1/tars/TensorRT-10.0.1.6.Linux.x86_64-gnu.cuda-12.4.tar.gz
TRT_URL_ARM=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.1/tars/TensorRT-10.0.1.6.ubuntu-22.04.aarch64-gnu.cuda-12.4.tar.gz

docker build -t trtllm_base \
             --build-arg BASE_IMAGE="${BASE_IMAGE}" \
             --build-arg TRT_VER="${TRT_VERSION}" \
             --build-arg RELEASE_URL_TRT_x86="${TRT_URL_x86}" \
             --build-arg RELEASE_URL_TRT_ARM="${TRT_URL_ARM}" \
             -f dockerfile/Dockerfile.triton.trt_llm_backend .

# Run the build script from Triton Server repo. The flags for some features or
# endpoints can be removed if not needed. Please refer to the support matrix to
# see the aligned versions: https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html
TRTLLM_BASE_IMAGE=trtllm_base
TENSORRTLLM_BACKEND_REPO_TAG=rel
PYTHON_BACKEND_REPO_TAG=r24.04

cd server
./build.py -v --no-container-interactive --enable-logging --enable-stats --enable-tracing \
              --enable-metrics --enable-gpu-metrics --enable-cpu-metrics \
              --filesystem=gcs --filesystem=s3 --filesystem=azure_storage \
              --endpoint=http --endpoint=grpc --endpoint=sagemaker --endpoint=vertex-ai \
              --backend=ensemble --enable-gpu --endpoint=http --endpoint=grpc \
              --no-container-pull \
              --image=base,${TRTLLM_BASE_IMAGE} \
              --backend=tensorrtllm:${TENSORRTLLM_BACKEND_REPO_TAG} \
              --backend=python:${PYTHON_BACKEND_REPO_TAG}

For triton server, I compiled tensorrtllm_backend with the most recent TensorRT-LLM:

git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend
git submodule update --init --recursive
rm -rf tensorrt_llm
git clone https://github.com/NVIDIA/TensorRT-LLM.git
mv TensorRT-LLM tensorrt_llm

DOCKER_BUILDKIT=1 docker build -t trt-llm -f dockerfile/Dockerfile.trt_llm_backend .
Sala8888 commented 3 days ago

Same problem, I followed steps provide from @yh-yao and @jasonngap1 and successfully create a version-consistent backend, but another problem occurred:

E0625 08:07:38.858565 1073 model_repository_manager.cc:579] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Invalid argument: unable to find backend library for backend '${triton_backend}', try specifying runtime on the model configuration.;

My tensorrt-llm container is created by image nvidia/cuda:12.2.0-devel-ubuntu22.04, with tensorrt==10.0.1, tensorrt-llm==0.11.0.dev2024061800, and the backend image is created by:

git clone https://github.com/triton-inference-server/tensorrtllm_backend.git  
cd tensorrtllm_backend  
git submodule update --init --recursive  
rm -rf tensorrt_llm  
git clone https://github.com/NVIDIA/TensorRT-LLM.git  
mv TensorRT-LLM tensorrt_llm  
DOCKER_BUILDKIT=1 docker build -t trt-llm -f dockerfile/Dockerfile.trt_llm_backend .

with tensorrt==10.0.1, tensorrt-llm==0.11.0.dev2024061800, triton==2.2.0+e28a256

Did I do something wrong?

jasonngap1 commented 3 days ago

Same problem, I followed steps provide from @yh-yao and @jasonngap1 and successfully create a version-consistent backend, but another problem occurred:

E0625 08:07:38.858565 1073 model_repository_manager.cc:579] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Invalid argument: unable to find backend library for backend '${triton_backend}', try specifying runtime on the model configuration.;

My tensorrt-llm container is created by image nvidia/cuda:12.2.0-devel-ubuntu22.04, with tensorrt==10.0.1, tensorrt-llm==0.11.0.dev2024061800, and the backend image is created by:

git clone https://github.com/triton-inference-server/tensorrtllm_backend.git  
cd tensorrtllm_backend  
git submodule update --init --recursive  
rm -rf tensorrt_llm  
git clone https://github.com/NVIDIA/TensorRT-LLM.git  
mv TensorRT-LLM tensorrt_llm  
DOCKER_BUILDKIT=1 docker build -t trt-llm -f dockerfile/Dockerfile.trt_llm_backend .

with tensorrt==10.0.1, tensorrt-llm==0.11.0.dev2024061800, triton==2.2.0+e28a256

Did I do something wrong?

Did @yh-yao 's steps but received this issue instead:

 UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:129)

Did you manage to resolve the issue?