NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

QwenVL build failed. #2483

Open Wonder-donbury opened 9 hours ago

Wonder-donbury commented 9 hours ago

System Info

Who can help?

@kaiyux (I'm referencing you as you were the editor in the related commit)

Information

Tasks

Reproduction

Steps to reproduce the behaviour

1 : Followed step 1 to 3-(build TensorRT-LLM engine) with exact same commands. 2 : Running into error message.

Expected behavior

Build gets done succesfully.

actual behavior

./build_qwenVL.sh <- below(exact same code in the examples/qwenvl)

trtllm-build --checkpoint_dir=./tllm_checkpoint_1gpu \
             --gemm_plugin=float16 --gpt_attention_plugin=float16 \
             --max_input_len=2048 --max_seq_len=3072 \
             --max_batch_size=8 --max_prompt_embedding_table_size=2048 \
             --remove_input_padding=enable \
             --output_dir=./trt_engines/Qwen-VL-7B-Chat

(venv) admin@inference_l4x4_admin:~/tensorrt$ ./build_qwenVL.sh --verbose Authorization required, but no authorization protocol specified Authorization required, but no authorization protocol specified Authorization required, but no authorization protocol specified

[TensorRT-LLM] TensorRT-LLM version: 0.16.0.dev2024111900 [11/22/2024-06:24:54] [TRT-LLM] [I] Set bert_attention_plugin to auto. [11/22/2024-06:24:54] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [11/22/2024-06:24:54] [TRT-LLM] [I] Set gemm_plugin to float16. [11/22/2024-06:24:54] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [11/22/2024-06:24:54] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [11/22/2024-06:24:54] [TRT-LLM] [I] Set nccl_plugin to auto. [11/22/2024-06:24:54] [TRT-LLM] [I] Set lora_plugin to None. [11/22/2024-06:24:54] [TRT-LLM] [I] Set moe_plugin to auto. [11/22/2024-06:24:54] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [11/22/2024-06:24:54] [TRT-LLM] [I] Set low_latency_gemm_plugin to None. [11/22/2024-06:24:54] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None. [11/22/2024-06:24:54] [TRT-LLM] [I] Set context_fmha to True. [11/22/2024-06:24:54] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False. [11/22/2024-06:24:54] [TRT-LLM] [I] Set remove_input_padding to True. [11/22/2024-06:24:54] [TRT-LLM] [I] Set reduce_fusion to False. [11/22/2024-06:24:54] [TRT-LLM] [I] Set enable_xqa to True. [11/22/2024-06:24:54] [TRT-LLM] [I] Set tokens_per_block to 64. [11/22/2024-06:24:54] [TRT-LLM] [I] Set use_paged_context_fmha to False. [11/22/2024-06:24:54] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [11/22/2024-06:24:54] [TRT-LLM] [I] Set multiple_profiles to False. [11/22/2024-06:24:54] [TRT-LLM] [I] Set paged_state to True. [11/22/2024-06:24:54] [TRT-LLM] [I] Set streamingllm to False. [11/22/2024-06:24:54] [TRT-LLM] [I] Set use_fused_mlp to True. [11/22/2024-06:24:54] [TRT-LLM] [I] Set pp_reduce_scatter to False. [11/22/2024-06:24:54] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen [11/22/2024-06:24:54] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0 [11/22/2024-06:24:54] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0 [11/22/2024-06:24:54] [TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = False [11/22/2024-06:24:54] [TRT-LLM] [I] Set dtype to bfloat16. [11/22/2024-06:24:54] [TRT-LLM] [I] Set paged_kv_cache to True. [11/22/2024-06:24:54] [TRT-LLM] [W] Overriding paged_state to False [11/22/2024-06:24:54] [TRT-LLM] [I] Set paged_state to False. [11/22/2024-06:24:54] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[11/22/2024-06:24:54] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored [11/22/2024-06:24:57] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 5665, GPU 322 (MiB) [11/22/2024-06:25:00] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +2276, GPU +440, now: CPU 8097, GPU 762 (MiB) [11/22/2024-06:25:00] [TRT-LLM] [I] Set nccl_plugin to None. [11/22/2024-06:25:00] [TRT] [W] IElementWiseLayer with inputs QWenForCausalLM/transformer/vocab_embedding/where_L2963/SELECT_2_output_0 and QWenForCausalLM/transformer/layers/0/attention/dense/multiply_collect_L272/multiply_and_lora_L238/_gemm_plugin_L129/PLUGIN_V2_Gemm_0_output_0: first input has type BFloat16 but second input has type Half. [11/22/2024-06:25:01] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (QWenForCausalLM/transformer/layers/0/add___L322/elementwise_binary_L2877/ELEMENTWISE_SUM_0: ElementWiseOperation SUM must have same input types. But they are of types BFloat16 and Half.) Traceback (most recent call last): File "/home/admin/tensorrt/venv/bin/trtllm-build", line 8, in sys.exit(main()) File "/home/admin/tensorrt/venv/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 627, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/home/admin/tensorrt/venv/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 425, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/home/admin/tensorrt/venv/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 390, in build_and_save engine = build_model(build_config, File "/home/admin/tensorrt/venv/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 383, in build_model return build(model, build_config) File "/home/admin/tensorrt/venv/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 1216, in build model(**inputs) File "/home/admin/tensorrt/venv/lib/python3.10/site-packages/tensorrt_llm/module.py", line 52, in call output = self.forward(args, kwargs) File "/home/admin/tensorrt/venv/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 962, in forward hidden_states = self.transformer.forward(kwargs) File "/home/admin/tensorrt/venv/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 193, in forward hidden_states = self.layers.forward( File "/home/admin/tensorrt/venv/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 550, in forward hidden_states = layer( File "/home/admin/tensorrt/venv/lib/python3.10/site-packages/tensorrt_llm/module.py", line 52, in call output = self.forward(args, **kwargs) File "/home/admin/tensorrt/venv/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 137, in forward hidden_states = residual + attention_output File "/home/admin/tensorrt/venv/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 322, in add return add(self, b) File "/home/admin/tensorrt/venv/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 2877, in elementwise_binary return _create_tensor(layer.get_output(0), layer) File "/home/admin/tensorrt/venv/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 608, in _create_tensor assert trt_tensor.shape.len( AssertionError: tensor QWenForCausalLM/transformer/layers/0/add___L322/elementwise_binary_L2877/ELEMENTWISE_SUM_0_output_0 has an invalid shape

additional notes

No clue.

Wonder-donbury commented 9 hours ago

Setting both --gemm_plugin and --gpt_attention_plugin to =auto resolved the problem, but I still want to know why the datatype wasn't matching. I've also chdecked that the compiled model works fine afterwards in the tensorRT