Closed jasonngap1 closed 4 months ago
Update: used tensorrt_llm version 0.10.0 to convert checkpoints and compile model and the error does not show for triton inference server. However, another error occurred: Assertion failed: Failed to deserialize cuda engine when using the same version of docker image for inference server.
I recommended using the commands in triton-trtllm repo: https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#build-the-docker-container to build the container.
nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
does not match the main branch of TRT-LLM
Thanks I have built the triton docker container using the main branch and compiled the model using the main branch of TensorRT-LLM but have met with this issue:
triton-models-1 | | tensorrt_llm | 1 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Attention window size (mMaxAttentionWindow) must be > 0 (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/trtGptModel.h:78) |
triton-models-1 | | | | 1 0x78c9a8384110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102 |
triton-models-1 | | | | 2 0x78c8cd18869d tensorrt_llm::batch_manager::TrtGptModel::TrtGptModel(tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1981 |
triton-models-1 | | | | 3 0x78c8cd18c95d tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 61 |
triton-models-1 | | | | 4 0x78c8cd1afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420 |
triton-models-1 | | | | 5 0x78c8cd1b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304 |
triton-models-1 | | | | 6 0x78c8cd1b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764 |
triton-models-1 | | | | 7 0x78c8cd1aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64 |
triton-models-1 | | | | 8 0x78c9a838f182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538 |
triton-models-1 | | | | 9 0x78c9a838f782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66 |
triton-models-1 | | | | 10 0x78c9b1a2c8f5 TRITONBACKEND_ModelInstanceInitialize + 101 |
triton-models-1 | | | | 11 0x78c9aff24086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x78c9aff24086] |
triton-models-1 | | | | 12 0x78c9aff252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x78c9aff252c6] |
triton-models-1 | | | | 13 0x78c9aff078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x78c9aff078d5] |
triton-models-1 | | | | 14 0x78c9aff07f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x78c9aff07f16] |
triton-models-1 | | | | 15 0x78c9aff1480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x78c9aff1480d] |
triton-models-1 | | | | 16 0x78c9af575ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x78c9af575ee8] |
triton-models-1 | | | | 17 0x78c9afefe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x78c9afefe64b] |
triton-models-1 | | | | 18 0x78c9aff0f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x78c9aff0f4f5] |
triton-models-1 | | | | 19 0x78c9aff13c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x78c9aff13c2e] |
triton-models-1 | | | | 20 0x78c9b0008318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x78c9b0008318] |
triton-models-1 | | | | 21 0x78c9b000bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x78c9b000bbfc] |
triton-models-1 | | | | 22 0x78c9b0167a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x78c9b0167a02] |
triton-models-1 | | | | 23 0x78c9af7e1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x78c9af7e1253] |
triton-models-1 | | | | 24 0x78c9af570ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x78c9af570ac3] |
triton-models-1 | | | | 25 0x78c9af601a04 clone + 68 |
triton-models-1 | +----------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1 |
triton-models-1 | I0620 06:16:48.723059 1 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090 Laptop GPU
triton-models-1 | I0620 06:16:48.724456 1 metrics.cc:770] Collecting CPU metrics
triton-models-1 | I0620 06:16:48.724538 1 tritonserver.cc:2538]
triton-models-1 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1 | | Option | Value |
triton-models-1 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1 | | server_id | triton |
triton-models-1 | | server_version | 2.45.0 |
triton-models-1 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
triton-models-1 | | model_repository_path[0] | /models/inflight-batch-llm |
triton-models-1 | | model_control_mode | MODE_NONE |
triton-models-1 | | strict_model_config | 0 |
triton-models-1 | | rate_limit | OFF |
triton-models-1 | | pinned_memory_pool_byte_size | 268435456 |
triton-models-1 | | cuda_memory_pool_byte_size{0} | 67108864 |
triton-models-1 | | min_supported_compute_capability | 6.0 |
triton-models-1 | | strict_readiness | 1 |
triton-models-1 | | exit_timeout | 30 |
triton-models-1 | | cache_enabled | 0 |
triton-models-1 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1 |
triton-models-1 | I0620 06:16:48.724541 1 server.cc:307] Waiting for in-flight requests to complete.
triton-models-1 | I0620 06:16:48.724545 1 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences
triton-models-1 | I0620 06:16:48.724715 1 server.cc:338] All models are stopped, unloading models
triton-models-1 | I0620 06:16:48.724718 1 server.cc:347] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
triton-models-1 | I0620 06:16:49.724821 1 server.cc:347] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
triton-models-1 | W0620 06:16:49.730014 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1 | Cleaning up...
triton-models-1 | I0620 06:16:49.899575 1 model_lifecycle.cc:620] successfully unloaded 'postprocessing' version 1
triton-models-1 | Cleaning up...
triton-models-1 | I0620 06:16:50.147048 1 model_lifecycle.cc:620] successfully unloaded 'preprocessing' version 1
triton-models-1 | I0620 06:16:50.724971 1 server.cc:347] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
triton-models-1 | W0620 06:16:50.737358 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1 | error: creating server: Internal - failed to load all models
triton-models-1 | W0620 06:16:51.738314 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
Thanks I have built the triton docker container using the main branch and compiled the model using the main branch of TensorRT-LLM but have met with this issue:
triton-models-1 | | tensorrt_llm | 1 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Attention window size (mMaxAttentionWindow) must be > 0 (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/trtGptModel.h:78) | triton-models-1 | | | | 1 0x78c9a8384110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102 | triton-models-1 | | | | 2 0x78c8cd18869d tensorrt_llm::batch_manager::TrtGptModel::TrtGptModel(tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1981 | triton-models-1 | | | | 3 0x78c8cd18c95d tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 61 | triton-models-1 | | | | 4 0x78c8cd1afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420 | triton-models-1 | | | | 5 0x78c8cd1b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304 | triton-models-1 | | | | 6 0x78c8cd1b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764 | triton-models-1 | | | | 7 0x78c8cd1aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64 | triton-models-1 | | | | 8 0x78c9a838f182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538 | triton-models-1 | | | | 9 0x78c9a838f782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66 | triton-models-1 | | | | 10 0x78c9b1a2c8f5 TRITONBACKEND_ModelInstanceInitialize + 101 | triton-models-1 | | | | 11 0x78c9aff24086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x78c9aff24086] | triton-models-1 | | | | 12 0x78c9aff252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x78c9aff252c6] | triton-models-1 | | | | 13 0x78c9aff078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x78c9aff078d5] | triton-models-1 | | | | 14 0x78c9aff07f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x78c9aff07f16] | triton-models-1 | | | | 15 0x78c9aff1480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x78c9aff1480d] | triton-models-1 | | | | 16 0x78c9af575ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x78c9af575ee8] | triton-models-1 | | | | 17 0x78c9afefe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x78c9afefe64b] | triton-models-1 | | | | 18 0x78c9aff0f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x78c9aff0f4f5] | triton-models-1 | | | | 19 0x78c9aff13c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x78c9aff13c2e] | triton-models-1 | | | | 20 0x78c9b0008318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x78c9b0008318] | triton-models-1 | | | | 21 0x78c9b000bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x78c9b000bbfc] | triton-models-1 | | | | 22 0x78c9b0167a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x78c9b0167a02] | triton-models-1 | | | | 23 0x78c9af7e1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x78c9af7e1253] | triton-models-1 | | | | 24 0x78c9af570ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x78c9af570ac3] | triton-models-1 | | | | 25 0x78c9af601a04 clone + 68 | triton-models-1 | +----------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ triton-models-1 | triton-models-1 | I0620 06:16:48.723059 1 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090 Laptop GPU triton-models-1 | I0620 06:16:48.724456 1 metrics.cc:770] Collecting CPU metrics triton-models-1 | I0620 06:16:48.724538 1 tritonserver.cc:2538] triton-models-1 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ triton-models-1 | | Option | Value | triton-models-1 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ triton-models-1 | | server_id | triton | triton-models-1 | | server_version | 2.45.0 | triton-models-1 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging | triton-models-1 | | model_repository_path[0] | /models/inflight-batch-llm | triton-models-1 | | model_control_mode | MODE_NONE | triton-models-1 | | strict_model_config | 0 | triton-models-1 | | rate_limit | OFF | triton-models-1 | | pinned_memory_pool_byte_size | 268435456 | triton-models-1 | | cuda_memory_pool_byte_size{0} | 67108864 | triton-models-1 | | min_supported_compute_capability | 6.0 | triton-models-1 | | strict_readiness | 1 | triton-models-1 | | exit_timeout | 30 | triton-models-1 | | cache_enabled | 0 | triton-models-1 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ triton-models-1 | triton-models-1 | I0620 06:16:48.724541 1 server.cc:307] Waiting for in-flight requests to complete. triton-models-1 | I0620 06:16:48.724545 1 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences triton-models-1 | I0620 06:16:48.724715 1 server.cc:338] All models are stopped, unloading models triton-models-1 | I0620 06:16:48.724718 1 server.cc:347] Timeout 30: Found 2 live models and 0 in-flight non-inference requests triton-models-1 | I0620 06:16:49.724821 1 server.cc:347] Timeout 29: Found 2 live models and 0 in-flight non-inference requests triton-models-1 | W0620 06:16:49.730014 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000 triton-models-1 | Cleaning up... triton-models-1 | I0620 06:16:49.899575 1 model_lifecycle.cc:620] successfully unloaded 'postprocessing' version 1 triton-models-1 | Cleaning up... triton-models-1 | I0620 06:16:50.147048 1 model_lifecycle.cc:620] successfully unloaded 'preprocessing' version 1 triton-models-1 | I0620 06:16:50.724971 1 server.cc:347] Timeout 28: Found 0 live models and 0 in-flight non-inference requests triton-models-1 | W0620 06:16:50.737358 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000 triton-models-1 | error: creating server: Internal - failed to load all models triton-models-1 | W0620 06:16:51.738314 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
I also face this issue.
Thanks I have built the triton docker container using the main branch and compiled the model using the main branch of TensorRT-LLM but have met with this issue:
triton-models-1 | | tensorrt_llm | 1 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Attention window size (mMaxAttentionWindow) must be > 0 (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/trtGptModel.h:78) | triton-models-1 | | | | 1 0x78c9a8384110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102 | triton-models-1 | | | | 2 0x78c8cd18869d tensorrt_llm::batch_manager::TrtGptModel::TrtGptModel(tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1981 | triton-models-1 | | | | 3 0x78c8cd18c95d tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 61 | triton-models-1 | | | | 4 0x78c8cd1afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420 | triton-models-1 | | | | 5 0x78c8cd1b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304 | triton-models-1 | | | | 6 0x78c8cd1b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764 | triton-models-1 | | | | 7 0x78c8cd1aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64 | triton-models-1 | | | | 8 0x78c9a838f182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538 | triton-models-1 | | | | 9 0x78c9a838f782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66 | triton-models-1 | | | | 10 0x78c9b1a2c8f5 TRITONBACKEND_ModelInstanceInitialize + 101 | triton-models-1 | | | | 11 0x78c9aff24086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x78c9aff24086] | triton-models-1 | | | | 12 0x78c9aff252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x78c9aff252c6] | triton-models-1 | | | | 13 0x78c9aff078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x78c9aff078d5] | triton-models-1 | | | | 14 0x78c9aff07f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x78c9aff07f16] | triton-models-1 | | | | 15 0x78c9aff1480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x78c9aff1480d] | triton-models-1 | | | | 16 0x78c9af575ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x78c9af575ee8] | triton-models-1 | | | | 17 0x78c9afefe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x78c9afefe64b] | triton-models-1 | | | | 18 0x78c9aff0f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x78c9aff0f4f5] | triton-models-1 | | | | 19 0x78c9aff13c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x78c9aff13c2e] | triton-models-1 | | | | 20 0x78c9b0008318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x78c9b0008318] | triton-models-1 | | | | 21 0x78c9b000bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x78c9b000bbfc] | triton-models-1 | | | | 22 0x78c9b0167a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x78c9b0167a02] | triton-models-1 | | | | 23 0x78c9af7e1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x78c9af7e1253] | triton-models-1 | | | | 24 0x78c9af570ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x78c9af570ac3] | triton-models-1 | | | | 25 0x78c9af601a04 clone + 68 | triton-models-1 | +----------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ triton-models-1 | triton-models-1 | I0620 06:16:48.723059 1 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090 Laptop GPU triton-models-1 | I0620 06:16:48.724456 1 metrics.cc:770] Collecting CPU metrics triton-models-1 | I0620 06:16:48.724538 1 tritonserver.cc:2538] triton-models-1 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ triton-models-1 | | Option | Value | triton-models-1 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ triton-models-1 | | server_id | triton | triton-models-1 | | server_version | 2.45.0 | triton-models-1 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging | triton-models-1 | | model_repository_path[0] | /models/inflight-batch-llm | triton-models-1 | | model_control_mode | MODE_NONE | triton-models-1 | | strict_model_config | 0 | triton-models-1 | | rate_limit | OFF | triton-models-1 | | pinned_memory_pool_byte_size | 268435456 | triton-models-1 | | cuda_memory_pool_byte_size{0} | 67108864 | triton-models-1 | | min_supported_compute_capability | 6.0 | triton-models-1 | | strict_readiness | 1 | triton-models-1 | | exit_timeout | 30 | triton-models-1 | | cache_enabled | 0 | triton-models-1 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ triton-models-1 | triton-models-1 | I0620 06:16:48.724541 1 server.cc:307] Waiting for in-flight requests to complete. triton-models-1 | I0620 06:16:48.724545 1 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences triton-models-1 | I0620 06:16:48.724715 1 server.cc:338] All models are stopped, unloading models triton-models-1 | I0620 06:16:48.724718 1 server.cc:347] Timeout 30: Found 2 live models and 0 in-flight non-inference requests triton-models-1 | I0620 06:16:49.724821 1 server.cc:347] Timeout 29: Found 2 live models and 0 in-flight non-inference requests triton-models-1 | W0620 06:16:49.730014 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000 triton-models-1 | Cleaning up... triton-models-1 | I0620 06:16:49.899575 1 model_lifecycle.cc:620] successfully unloaded 'postprocessing' version 1 triton-models-1 | Cleaning up... triton-models-1 | I0620 06:16:50.147048 1 model_lifecycle.cc:620] successfully unloaded 'preprocessing' version 1 triton-models-1 | I0620 06:16:50.724971 1 server.cc:347] Timeout 28: Found 0 live models and 0 in-flight non-inference requests triton-models-1 | W0620 06:16:50.737358 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000 triton-models-1 | error: creating server: Internal - failed to load all models triton-models-1 | W0620 06:16:51.738314 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
I also face this issue.
To those who see this issue, it is due to the mismatch of tensorrt in TensorRT_LLM and the tensorrt in tensorrtllm_backend. Please git pull
first then compile the model by the most recent version of TenroRT_LLM.
@yh-yao Hi is it possible to please share with me how you resolved this issue in detail? I have updated both TensorRT_LLM and tensorrtllm_backend, compiled the model again, but have met with this issue below. The tensorrt version i have installed is 10.0.1
and tensorrt-llm version is 0.11.0.dev2024061800
.
triton-models-1 | | tensorrt_llm | 1 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:129) |
triton-models-1 | | | | 1 0x72269c65d110 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102 |
triton-models-1 | | | | 2 0x7225c71c9fa2 /app/tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x73cfa2) [0x7225c71c9fa2] |
triton-models-1 | | | | 3 0x7225c918cce2 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 962 |
triton-models-1 | | | | 4 0x7225c91afe84 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420 |
triton-models-1 | | | | 5 0x7225c91b0718 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::path> const&, std::optional<std::vector<unsigned char, std::allocator<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304 |
triton-models-1 | | | | 6 0x7225c91b5e94 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional<std::filesystem::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764 |
triton-models-1 | | | | 7 0x7225c91aae60 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64 |
triton-models-1 | | | | 8 0x72269c668182 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 1538 |
triton-models-1 | | | | 9 0x72269c668782 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66 |
triton-models-1 | | | | 10 0x7226afd338f5 TRITONBACKEND_ModelInstanceInitialize + 101 |
triton-models-1 | | | | 11 0x7226ae324086 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af086) [0x7226ae324086] |
triton-models-1 | | | | 12 0x7226ae3252c6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02c6) [0x7226ae3252c6] |
triton-models-1 | | | | 13 0x7226ae3078d5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928d5) [0x7226ae3078d5] |
triton-models-1 | | | | 14 0x7226ae307f16 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f16) [0x7226ae307f16] |
triton-models-1 | | | | 15 0x7226ae31480d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f80d) [0x7226ae31480d] |
triton-models-1 | | | | 16 0x7226ad975ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7226ad975ee8] |
triton-models-1 | | | | 17 0x7226ae2fe64b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18964b) [0x7226ae2fe64b] |
triton-models-1 | | | | 18 0x7226ae30f4f5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19a4f5) [0x7226ae30f4f5] |
triton-models-1 | | | | 19 0x7226ae313c2e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19ec2e) [0x7226ae313c2e] |
triton-models-1 | | | | 20 0x7226ae408318 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x293318) [0x7226ae408318] |
triton-models-1 | | | | 21 0x7226ae40bbfc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x296bfc) [0x7226ae40bbfc] |
triton-models-1 | | | | 22 0x7226ae567a02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f2a02) [0x7226ae567a02] |
triton-models-1 | | | | 23 0x7226adbe1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7226adbe1253] |
triton-models-1 | | | | 24 0x7226ad970ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7226ad970ac3] |
triton-models-1 | | | | 25 0x7226ada01a04 clone + 68 |
triton-models-1 | +----------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1 |
triton-models-1 | I0624 00:59:58.653384 1 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090 Laptop GPU
triton-models-1 | I0624 00:59:58.654783 1 metrics.cc:770] Collecting CPU metrics
triton-models-1 | I0624 00:59:58.654885 1 tritonserver.cc:2538]
triton-models-1 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1 | | Option | Value |
triton-models-1 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1 | | server_id | triton |
triton-models-1 | | server_version | 2.45.0 |
triton-models-1 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
triton-models-1 | | model_repository_path[0] | /models/inflight-batch-llm |
triton-models-1 | | model_control_mode | MODE_NONE |
triton-models-1 | | strict_model_config | 0 |
triton-models-1 | | rate_limit | OFF |
triton-models-1 | | pinned_memory_pool_byte_size | 268435456 |
triton-models-1 | | cuda_memory_pool_byte_size{0} | 67108864 |
triton-models-1 | | min_supported_compute_capability | 6.0 |
triton-models-1 | | strict_readiness | 1 |
triton-models-1 | | exit_timeout | 30 |
triton-models-1 | | cache_enabled | 0 |
triton-models-1 | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1 |
triton-models-1 | I0624 00:59:58.654890 1 server.cc:307] Waiting for in-flight requests to complete.
triton-models-1 | I0624 00:59:58.654894 1 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences
triton-models-1 | I0624 00:59:58.655094 1 server.cc:338] All models are stopped, unloading models
triton-models-1 | I0624 00:59:58.655098 1 server.cc:347] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
triton-models-1 | I0624 00:59:59.655307 1 server.cc:347] Timeout 29: Found 2 live models and 0 in-flight non-inference requests
triton-models-1 | W0624 00:59:59.660270 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1 | Cleaning up...
triton-models-1 | I0624 00:59:59.885307 1 model_lifecycle.cc:620] successfully unloaded 'preprocessing' version 1
triton-models-1 | Cleaning up...
triton-models-1 | I0624 01:00:00.173149 1 model_lifecycle.cc:620] successfully unloaded 'postprocessing' version 1
triton-models-1 | I0624 01:00:00.655472 1 server.cc:347] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
triton-models-1 | W0624 01:00:00.671229 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
triton-models-1 | error: creating server: Internal - failed to load all models
triton-models-1 | W0624 01:00:01.672596 1 metrics.cc:631] Unable to get power limit for GPU 0. Status:Success, value:0.000000
@yh-yao did you use the following steps (including base image version) to build the triton server?
# Prepare the TRT-LLM base image using the dockerfile from tensorrtllm_backend.
cd tensorrtllm_backend
# Specify the build args for the dockerfile.
BASE_IMAGE=nvcr.io/nvidia/pytorch:24.03-py3
TRT_VERSION=10.0.1.6
TRT_URL_x86=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.1/tars/TensorRT-10.0.1.6.Linux.x86_64-gnu.cuda-12.4.tar.gz
TRT_URL_ARM=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.1/tars/TensorRT-10.0.1.6.ubuntu-22.04.aarch64-gnu.cuda-12.4.tar.gz
docker build -t trtllm_base \
--build-arg BASE_IMAGE="${BASE_IMAGE}" \
--build-arg TRT_VER="${TRT_VERSION}" \
--build-arg RELEASE_URL_TRT_x86="${TRT_URL_x86}" \
--build-arg RELEASE_URL_TRT_ARM="${TRT_URL_ARM}" \
-f dockerfile/Dockerfile.triton.trt_llm_backend .
# Run the build script from Triton Server repo. The flags for some features or
# endpoints can be removed if not needed. Please refer to the support matrix to
# see the aligned versions: https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html
TRTLLM_BASE_IMAGE=trtllm_base
TENSORRTLLM_BACKEND_REPO_TAG=rel
PYTHON_BACKEND_REPO_TAG=r24.04
cd server
./build.py -v --no-container-interactive --enable-logging --enable-stats --enable-tracing \
--enable-metrics --enable-gpu-metrics --enable-cpu-metrics \
--filesystem=gcs --filesystem=s3 --filesystem=azure_storage \
--endpoint=http --endpoint=grpc --endpoint=sagemaker --endpoint=vertex-ai \
--backend=ensemble --enable-gpu --endpoint=http --endpoint=grpc \
--no-container-pull \
--image=base,${TRTLLM_BASE_IMAGE} \
--backend=tensorrtllm:${TENSORRTLLM_BACKEND_REPO_TAG} \
--backend=python:${PYTHON_BACKEND_REPO_TAG}
@yh-yao did you use the following steps (including base image version) to build the triton server?
# Prepare the TRT-LLM base image using the dockerfile from tensorrtllm_backend. cd tensorrtllm_backend # Specify the build args for the dockerfile. BASE_IMAGE=nvcr.io/nvidia/pytorch:24.03-py3 TRT_VERSION=10.0.1.6 TRT_URL_x86=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.1/tars/TensorRT-10.0.1.6.Linux.x86_64-gnu.cuda-12.4.tar.gz TRT_URL_ARM=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.1/tars/TensorRT-10.0.1.6.ubuntu-22.04.aarch64-gnu.cuda-12.4.tar.gz docker build -t trtllm_base \ --build-arg BASE_IMAGE="${BASE_IMAGE}" \ --build-arg TRT_VER="${TRT_VERSION}" \ --build-arg RELEASE_URL_TRT_x86="${TRT_URL_x86}" \ --build-arg RELEASE_URL_TRT_ARM="${TRT_URL_ARM}" \ -f dockerfile/Dockerfile.triton.trt_llm_backend . # Run the build script from Triton Server repo. The flags for some features or # endpoints can be removed if not needed. Please refer to the support matrix to # see the aligned versions: https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html TRTLLM_BASE_IMAGE=trtllm_base TENSORRTLLM_BACKEND_REPO_TAG=rel PYTHON_BACKEND_REPO_TAG=r24.04 cd server ./build.py -v --no-container-interactive --enable-logging --enable-stats --enable-tracing \ --enable-metrics --enable-gpu-metrics --enable-cpu-metrics \ --filesystem=gcs --filesystem=s3 --filesystem=azure_storage \ --endpoint=http --endpoint=grpc --endpoint=sagemaker --endpoint=vertex-ai \ --backend=ensemble --enable-gpu --endpoint=http --endpoint=grpc \ --no-container-pull \ --image=base,${TRTLLM_BASE_IMAGE} \ --backend=tensorrtllm:${TENSORRTLLM_BACKEND_REPO_TAG} \ --backend=python:${PYTHON_BACKEND_REPO_TAG}
For triton server, I compiled tensorrtllm_backend with the most recent TensorRT-LLM:
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend
git submodule update --init --recursive
rm -rf tensorrt_llm
git clone https://github.com/NVIDIA/TensorRT-LLM.git
mv TensorRT-LLM tensorrt_llm
DOCKER_BUILDKIT=1 docker build -t trt-llm -f dockerfile/Dockerfile.trt_llm_backend .
Same problem, I followed steps provide from @yh-yao and @jasonngap1 and successfully create a version-consistent backend, but another problem occurred:
E0625 08:07:38.858565 1073 model_repository_manager.cc:579] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Invalid argument: unable to find backend library for backend '${triton_backend}', try specifying runtime on the model configuration.;
My tensorrt-llm container is created by image nvidia/cuda:12.2.0-devel-ubuntu22.04
, with tensorrt==10.0.1
, tensorrt-llm==0.11.0.dev2024061800
, and the backend image is created by:
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend
git submodule update --init --recursive
rm -rf tensorrt_llm
git clone https://github.com/NVIDIA/TensorRT-LLM.git
mv TensorRT-LLM tensorrt_llm
DOCKER_BUILDKIT=1 docker build -t trt-llm -f dockerfile/Dockerfile.trt_llm_backend .
with tensorrt==10.0.1
, tensorrt-llm==0.11.0.dev2024061800
, triton==2.2.0+e28a256
Did I do something wrong?
Same problem, I followed steps provide from @yh-yao and @jasonngap1 and successfully create a version-consistent backend, but another problem occurred:
E0625 08:07:38.858565 1073 model_repository_manager.cc:579] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Invalid argument: unable to find backend library for backend '${triton_backend}', try specifying runtime on the model configuration.;
My tensorrt-llm container is created by image
nvidia/cuda:12.2.0-devel-ubuntu22.04
, withtensorrt==10.0.1
,tensorrt-llm==0.11.0.dev2024061800
, and the backend image is created by:git clone https://github.com/triton-inference-server/tensorrtllm_backend.git cd tensorrtllm_backend git submodule update --init --recursive rm -rf tensorrt_llm git clone https://github.com/NVIDIA/TensorRT-LLM.git mv TensorRT-LLM tensorrt_llm DOCKER_BUILDKIT=1 docker build -t trt-llm -f dockerfile/Dockerfile.trt_llm_backend .
with
tensorrt==10.0.1
,tensorrt-llm==0.11.0.dev2024061800
,triton==2.2.0+e28a256
Did I do something wrong?
Did @yh-yao 's steps but received this issue instead:
UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:129)
Did you manage to resolve the issue?
Hello,
can anyone explain how did you solved below error ?: E0625 08:07:38.858565 1073 model_repository_manager.cc:579] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Invalid argument: unable to find backend library for backend '${triton_backend}', try specifying runtime on the model configuration.;
using tensorrt_backend=0.10.0 I have generated engine file and trying to deploy model using below command:
tritonserver --model-repository=path-to-tensorrt-engine --model-control-mode=explicit --load-model=preprocessing --load-model=postprocessing --load-model=tensorrt_llm --load-model=tensorrt_llm_bls --load-model=ensemble --log-verbose=2 --log-info=1 --log-warning=1 --log-error=1
System Info
Who can help?
@kaiyux @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
I would expect the tensorrt engine to work with the triton inference server
actual behavior
additional notes
Triton Inference Server used : nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 Model used: Llama3-ChatQA-1.5-8B