Failed to deserialize cuda engine when using "tp_size=4"

PeterWang1986 commented 9 months ago

System Info

CPU architecture (x86_64)
GPU name (NVIDIA A10)
TensorRT-LLM commit (build from tensorrtllm_backend which commit is: 3608b0)

Who can help?

Hi all, We use "triton + tensorrtllm_backend + TensorRT-LLM" to deploy mistral-7b model. We build model with "tp_size=4", and deploy the engine in A10 gpus, but it always failed due to "UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)"

Here is my build config: "build_config": { "max_input_len": 16384, "max_output_len": 1024, "max_batch_size": 8, "max_beam_width": 1, "max_num_tokens": 8192, "max_prompt_embedding_table_size": 0, "gather_context_logits": false, "gather_generation_logits": false, "strongly_typed": false, "builder_opt": null, "profiling_verbosity": "layer_names_only", "plugin_config": { "bert_attention_plugin": "float16", "gpt_attention_plugin": "float16", "gemm_plugin": "float16", "smooth_quant_gemm_plugin": null, "identity_plugin": null, "layernorm_quantization_plugin": null, "rmsnorm_quantization_plugin": null, "nccl_plugin": "float16", "lookup_plugin": null, "lora_plugin": null, "weight_only_groupwise_quant_matmul_plugin": null, "weight_only_quant_matmul_plugin": null, "quantize_per_token_plugin": false, "quantize_tensor_plugin": false, "context_fmha": true, "context_fmha_fp32_acc": false, "paged_kv_cache": true, "remove_input_padding": true, "use_custom_all_reduce": false, "multi_block_mode": false, "enable_xqa": true, "attention_qk_half_accumulation": false, "tokens_per_block": 128, "use_paged_context_fmha": false, "use_context_fmha_for_generation": false } It also failed on release0.7.1

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

start container(triton_trt_llm:main-3608b0), which build from https://github.com/triton-inference-server/tensorrtllm_backend/tree/main Option2
convert checkpoint: python /app/tensorrt_llm/examples/llama/convert_checkpoint.py \ --model_dir xxxx \ --output_dir xxx \ --dtype float16 \ --tp_size 4
build engine: trtllm-build \ --checkpoint_dir xxxx \ --output_dir xxxx \ --gpt_attention_plugin float16 \ --gemm_plugin float16 \ --remove_input_padding enable \ --context_fmha enable \ --paged_kv_cache enable \ --use_custom_all_reduce disable \ --max_input_len=16384 \ --max_output_len=1024 \ --max_num_tokens=8192 \ --max_batch_size=8
start triton server: python3 /app/scripts/launch_triton_server.py --world_size 4 --model_repo=xxxx

Expected behavior

load model successful

actual behavior

E0202 05:43:08.193459 [2024-02-02 13:43:08] 1 [2024-02-02 13:43:08] 2 [2024-02-02 13:43:08] 3 [2024-02-02 13:43:08] 4 [2024-02-02 13:43:08] 5 [2024-02-02 13:43:08] 6 [2024-02-02 13:43:08] 7 [2024-02-02 13:43:08] 8 [2024-02-02 13:43:08] 9 [2024-02-02 13:43:08] 10 [2024-02-02 13:43:08] 11 [2024-02-02 13:43:08] 12 [2024-02-02 13:43:08] 13 [2024-02-02 13:43:08] 14 [2024-02-02 13:43:08] 15 [2024-02-02 13:43:08] 16 [2024-02-02 13:43:08] 17 [2024-02-02 13:43:08] 18 [2024-02-02 13:43:08] 19 [2024-02-02 13:43:08] 20 [2024-02-02 13:43:08] 21 [2024-02-02 13:43:08] 22 [2024-02-02 13:43:08] 23 [2024-02-02 13:43:08] 60 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72) 0x7f7e8c2617da tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102 0x7f7e8c28522e /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x79e22e) [0x7f7e8c28522e] 0x7f7e8e150ea1 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1025 0x7f7e8e1275a9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1449 0x7f7e8e11d3a0 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr, std::allocator<std::shared_ptr > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash, std::equal_to, std::allocator > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional, std::optional, bool) + 320 0x7f7f90028a11 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x19a11) [0x7f7f90028a11] 0x7f7f90029c52 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1ac52) [0x7f7f90029c52] 0x7f7f9001afc5 TRITONBACKEND_ModelInstanceInitialize + 101 0x7f7fa9b89226 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a7226) [0x7f7fa9b89226] 0x7f7fa9b8a466 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a8466) [0x7f7fa9b8a466] 0x7f7fa9b6d165 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b165) [0x7f7fa9b6d165] 0x7f7fa9b6d7a6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18b7a6) [0x7f7fa9b6d7a6] 0x7f7fa9b79a1d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x197a1d) [0x7f7fa9b79a1d] 0x7f7fa91e4ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f7fa91e4ee8] 0x7f7fa9b63feb /opt/tritonserver/bin/../lib/libtritonserver.so(+0x181feb) [0x7f7fa9b63feb] 0x7f7fa9b73dc5 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191dc5) [0x7f7fa9b73dc5] 0x7f7fa9b78d36 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x196d36) [0x7f7fa9b78d36] 0x7f7fa9c69330 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x287330) [0x7f7fa9c69330] 0x7f7fa9c6ca23 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28aa23) [0x7f7fa9c6ca23] 0x7f7fa9dc0d82 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3ded82) [0x7f7fa9dc0d82] 0x7f7fa944f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f7fa944f253] 0x7f7fa91dfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f7fa91dfac3] 0x7f7fa9270814 clone + 68 I0202 05:43:08.193560 60 model_lifecycle.cc:756] failed to load 'tensorrt_llm'

additional notes

everything is work fine for: tp_size =1, tp_size = 2 And I build the model engine on single A10 gpu, and deploy the engine on other A10 GPUs node.

pcastonguay commented 8 months ago

The error typically indicates a mismatch between the TRT-LLM versions used when generating the TRT engine and the version used by the runtime. Could you verify that they are the same? Or maybe the generation of the engine for TP=4 case failed for some reason, preventing proper deserialization of the engine?

itechbear commented 8 months ago

~I just ran into the same error with tensorrt llm 0.8.0, Qwen-72B-Chat TP=4. However, I haven't tried other TP values.~

~I used nvidia/cuda:12.1.0-devel-ubuntu22.04 to build the model engine and nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 to serve the model. Their cuda versions diff in terms of minor versions (12.1 v.s. 12.3). I'm not sure whether it matters.~

Never mind. It might be caused by some other reason. After I restarted the container, the service spinned up successfully.

aoyifei commented 8 months ago

The error typically indicates a mismatch between the TRT-LLM versions used when generating the TRT engine and the version used by the runtime. Could you verify that they are the same? Or maybe the generation of the engine for TP=4 case failed for some reason, preventing proper deserialization of the engine?

I also encountered the similar problem, could you supply some guide on how to check the version of my trt-llm when generating the TRT engine and the version of runtime? I build the environment from ngc image nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3, then git clone the TensorRT-LLM repo. `# Update the submodule TensorRT-LLM repository git submodule update --init --recursive git lfs install git lfs pull

(cd tensorrt_llm && bash docker/common/install_cmake.sh && export PATH=/usr/local/cmake/bin:$PATH && python3 ./scripts/build_wheel.py --trt_root="/usr/local/tensorrt" && pip3 install ./build/tensorrt_llm*.whl)`

After that, I use convert_checkpoint.py convert ChatGLM3 model checkpoint, then use trtllm-build build the engine, then start triton-server all in the same container. So i am wondering where did i go wrong, should i git check out to release 0.8.0 after clone the TensorRT-LLM repo? Or for now, how could i check the version of build engin and runtime engine?

PeterWang1986 commented 6 months ago

hi all, I upgrade the trt version to V0.9.0, then everything works fine now for:

mistral-7b with TP = 4 on A30
mixtral-8x7b with TP = 4 on A100-SXM-80g

but failed on L40S for mistral-7b (TP=4) with follow error: error: creating server: Invalid argument - load failed for model 'tensorrt_llm': version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:68)

NVIDIA / TensorRT-LLM