Open PeterWang1986 opened 9 months ago
The error typically indicates a mismatch between the TRT-LLM versions used when generating the TRT engine and the version used by the runtime. Could you verify that they are the same? Or maybe the generation of the engine for TP=4 case failed for some reason, preventing proper deserialization of the engine?
~I just ran into the same error with tensorrt llm 0.8.0, Qwen-72B-Chat TP=4. However, I haven't tried other TP values.~
~I used nvidia/cuda:12.1.0-devel-ubuntu22.04
to build the model engine and nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
to serve the model. Their cuda versions diff in terms of minor versions (12.1 v.s. 12.3). I'm not sure whether it matters.~
Never mind. It might be caused by some other reason. After I restarted the container, the service spinned up successfully.
The error typically indicates a mismatch between the TRT-LLM versions used when generating the TRT engine and the version used by the runtime. Could you verify that they are the same? Or maybe the generation of the engine for TP=4 case failed for some reason, preventing proper deserialization of the engine?
I also encountered the similar problem, could you supply some guide on how to check the version of my trt-llm when generating the TRT engine and the version of runtime? I build the environment from ngc image nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3, then git clone the TensorRT-LLM repo. `# Update the submodule TensorRT-LLM repository git submodule update --init --recursive git lfs install git lfs pull
(cd tensorrt_llm && bash docker/common/install_cmake.sh && export PATH=/usr/local/cmake/bin:$PATH && python3 ./scripts/build_wheel.py --trt_root="/usr/local/tensorrt" && pip3 install ./build/tensorrt_llm*.whl)`
After that, I use convert_checkpoint.py convert ChatGLM3 model checkpoint, then use trtllm-build build the engine, then start triton-server all in the same container. So i am wondering where did i go wrong, should i git check out to release 0.8.0 after clone the TensorRT-LLM repo? Or for now, how could i check the version of build engin and runtime engine?
hi all, I upgrade the trt version to V0.9.0, then everything works fine now for:
but failed on L40S for mistral-7b (TP=4) with follow error: error: creating server: Invalid argument - load failed for model 'tensorrt_llm': version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:68)
System Info
Who can help?
Hi all, We use "triton + tensorrtllm_backend + TensorRT-LLM" to deploy mistral-7b model. We build model with "tp_size=4", and deploy the engine in A10 gpus, but it always failed due to "UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)"
Here is my build config: "build_config": { "max_input_len": 16384, "max_output_len": 1024, "max_batch_size": 8, "max_beam_width": 1, "max_num_tokens": 8192, "max_prompt_embedding_table_size": 0, "gather_context_logits": false, "gather_generation_logits": false, "strongly_typed": false, "builder_opt": null, "profiling_verbosity": "layer_names_only", "plugin_config": { "bert_attention_plugin": "float16", "gpt_attention_plugin": "float16", "gemm_plugin": "float16", "smooth_quant_gemm_plugin": null, "identity_plugin": null, "layernorm_quantization_plugin": null, "rmsnorm_quantization_plugin": null, "nccl_plugin": "float16", "lookup_plugin": null, "lora_plugin": null, "weight_only_groupwise_quant_matmul_plugin": null, "weight_only_quant_matmul_plugin": null, "quantize_per_token_plugin": false, "quantize_tensor_plugin": false, "context_fmha": true, "context_fmha_fp32_acc": false, "paged_kv_cache": true, "remove_input_padding": true, "use_custom_all_reduce": false, "multi_block_mode": false, "enable_xqa": true, "attention_qk_half_accumulation": false, "tokens_per_block": 128, "use_paged_context_fmha": false, "use_context_fmha_for_generation": false } It also failed on release0.7.1
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
load model successful
actual behavior
E0202 05:43:08.193459 60 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72) [2024-02-02 13:43:08] 1 0x7f7e8c2617da tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102 [2024-02-02 13:43:08] 2 0x7f7e8c28522e /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x79e22e) [0x7f7e8c28522e] [2024-02-02 13:43:08] 3 0x7f7e8e150ea1 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1025
[2024-02-02 13:43:08] 4 0x7f7e8e1275a9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1449
[2024-02-02 13:43:08] 5 0x7f7e8e11d3a0 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr, std::allocator<std::shared_ptr > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash, std::equal_to, std::allocator > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional, std::optional, bool) + 320
[2024-02-02 13:43:08] 6 0x7f7f90028a11 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x19a11) [0x7f7f90028a11]
[2024-02-02 13:43:08] 7 0x7f7f90029c52 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1ac52) [0x7f7f90029c52]
[2024-02-02 13:43:08] 8 0x7f7f9001afc5 TRITONBACKEND_ModelInstanceInitialize + 101
[2024-02-02 13:43:08] 9 0x7f7fa9b89226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7f7fa9b89226]
[2024-02-02 13:43:08] 10 0x7f7fa9b8a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7f7fa9b8a466]
[2024-02-02 13:43:08] 11 0x7f7fa9b6d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7f7fa9b6d165]
[2024-02-02 13:43:08] 12 0x7f7fa9b6d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7f7fa9b6d7a6]
[2024-02-02 13:43:08] 13 0x7f7fa9b79a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7f7fa9b79a1d]
[2024-02-02 13:43:08] 14 0x7f7fa91e4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f7fa91e4ee8]
[2024-02-02 13:43:08] 15 0x7f7fa9b63feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7f7fa9b63feb]
[2024-02-02 13:43:08] 16 0x7f7fa9b73dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7f7fa9b73dc5]
[2024-02-02 13:43:08] 17 0x7f7fa9b78d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7f7fa9b78d36]
[2024-02-02 13:43:08] 18 0x7f7fa9c69330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7f7fa9c69330]
[2024-02-02 13:43:08] 19 0x7f7fa9c6ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7f7fa9c6ca23]
[2024-02-02 13:43:08] 20 0x7f7fa9dc0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7f7fa9dc0d82]
[2024-02-02 13:43:08] 21 0x7f7fa944f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f7fa944f253]
[2024-02-02 13:43:08] 22 0x7f7fa91dfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f7fa91dfac3]
[2024-02-02 13:43:08] 23 0x7f7fa9270814 clone + 68
[2024-02-02 13:43:08] I0202 05:43:08.193560 60 model_lifecycle.cc:756] failed to load 'tensorrt_llm'
additional notes
everything is work fine for: tp_size =1, tp_size = 2 And I build the model engine on single A10 gpu, and deploy the engine on other A10 GPUs node.