NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.61k stars 978 forks source link

Baichuan model encounters an error when use_inflight_batching. #196

Open viningz opened 1 year ago

viningz commented 1 year ago

When I was converting the Bai Chuan model and wanted to enable inflight batching, an error occurred. The error message is as follows:

[10/30/2023-09:28:34] [TRT] [W] Unused Input: position_ids [10/30/2023-09:28:34] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [10/30/2023-09:28:34] [TRT] [I] Graph optimization time: 0.0630199 seconds. [10/30/2023-09:28:34] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 15556, GPU 1786 (MiB) [10/30/2023-09:28:34] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 15558, GPU 1796 (MiB) [10/30/2023-09:28:34] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [10/30/2023-09:28:35] [TRT] [E] 9: Skipping tactic0x0000000000000000 due to exception PLUGIN_V2 operation not supported within this graph. [10/30/2023-09:28:36] [TRT] [E] 10: Could not find any implementation for node {ForeignNode[BaichuanForCausalLM/layers/0/attention/PLUGIN_V2_GPTAttention_0]}. [10/30/2023-09:28:36] [TRT] [E] 10: [optimizer.cpp::computeCosts::4040] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[BaichuanForCausalLM/layers/0/attention/PLUGIN_V2_GPTAttention_0]}.) [10/30/2023-09:28:36] [TRT-LLM] [E] Engine building failed, please check the error log. [10/30/2023-09:28:36] [TRT-LLM] [I] Config saved to /home/kas/models/word_layout/baichuang/lora_v10_merge_2/trt_engines/fp16/1-gpu-page-kv-cache/config.json. Traceback (most recent call last): File "/home/kas/kas_workspace/zhengweining/TensorRT-LLM/examples/baichuan/build.py", line 477, in build(0, args) File "/home/kas/kas_workspace/zhengweining/TensorRT-LLM/examples/baichuan/build.py", line 449, in build assert engine is not None, f'Failed to build engine for rank {cur_rank}' AssertionError: Failed to build engine for rank 0

Could you please let me know how to resolve this issue?

jdemouth-nvidia commented 1 year ago

Hi @viningz ,

Can you share the command-line you used to build the engine, please?

Thanks, Julien

viningz commented 1 year ago

Hi @viningz ,

Can you share the command-line you used to build the engine, please?

Thanks, Julien

Thank you very much for your reply! The input command line I have is:

python build.py --model_version v1_7b \ --model_dir baichuan-inc/Baichuan-13B-Chat \ --dtype float16 \ --use_gemm_plugin float16 \ --use_gpt_attention_plugin float16 \ --output_dir ./tmp/baichuan_v1_13b/trt_engines/fp16/1-gpu/ --use_inflight_batching

viningz commented 1 year ago

Hi @viningz ,

Can you share the command-line you used to build the engine, please?

Thanks, Julien

When I don't use the --use_inflight_batching option, the converted model works fine.

renwuli commented 11 months ago

same issue for llama_7b

byshiue commented 11 months ago

Could you follow this document and try again on latest main branch?