NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.56k stars 973 forks source link

How to add gemm_plugin int8 #2126

Closed xiangxinhello closed 2 months ago

xiangxinhello commented 2 months ago

System Info

A100-PCIe-40GB Tensorrt-LLM-verison:0.11.0

Who can help?

@Tracin

Information

Tasks

Reproduction

python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/ \ --output_dir ./tllm_checkpoint_1gpu_fp16_wq \ --dtype float16 \ --use_weight_only \ --weight_only_precision int8

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16_wq \ --output_dir ./tmp/qwen/7B/trt_engines/weight_only/1-gpu/ \ --gemm_plugin int8

self.smooth_quant_gemm_plugin = "int8" I set def set_smooth_quant_plugins(self, dtype: str = "auto"): self.smooth_quant_gemm_plugin = "int8" self.rmsnorm_quantization_plugin = dtype self.layernorm_quantization_plugin = dtype self.quantize_per_token_plugin = True self.quantize_tensor_plugin = True return self

Expected behavior

Engine created succesfully.

actual behavior

but error: [08/16/2024-08:54:52] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to int8. [08/16/2024-08:54:52] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16. [08/16/2024-08:54:52] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16. [08/16/2024-08:54:52] [TRT-LLM] [I] Set quantize_per_token_plugin to True. [08/16/2024-08:54:52] [TRT-LLM] [I] Set quantize_tensor_plugin to True. [08/16/2024-08:54:52] [TRT-LLM] [I] Set nccl_plugin to None. [08/16/2024-08:54:52] [TRT-LLM] [I] Set use_custom_all_reduce to True. [08/16/2024-08:54:52] [TRT] [W] IElementWiseLayer with inputs QWenForCausalLM/transformer/layers/0/attention/qkv/smooth_quant_gemm/PLUGIN_V2_SmoothQuantGemm_0_output_0 and QWenForCausalLM/transformer/layers/0/attention/qkv/add/elementwise_binary/broadcast_helper/expand_dims_like/expand_dims/view/SHUFFLE_0_output_0: first input has type Int8 but second input has type Half. [08/16/2024-08:54:52] [TRT] [E] ITensor::getDimensions: Error Code 4: Internal Error (QWenForCausalLM/transformer/layers/0/attention/qkv/add/elementwise_binary/ELEMENTWISE_SUM_0: ElementWiseOperation SUM must have same input types. But they are of types Int8 and Half.) Traceback (most recent call last): File "/root/anaconda3/envs/trt_llm/bin/trtllm-build", line 8, in sys.exit(main()) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 551, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 373, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 340, in build_and_save engine = build_model(build_config, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 333, in build_model return build(model, build_config) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 890, in build model(inputs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 713, in forward hidden_states = self.transformer.forward(kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 196, in forward hidden_states = self.layers.forward(hidden_states, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 327, in forward hidden_states = layer( File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 121, in forward attention_output = self.attention( File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, *kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/quantization/layers.py", line 1222, in forward qkv = self.qkv(hidden_states) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, **kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/quantization/layers.py", line 147, in forward x = x + self.bias.value File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 321, in add return add(self, b) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 2825, in elementwise_binary return _create_tensor(layer.get_output(0), layer) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 607, in _create_tensor assert trt_tensor.shape.len( AssertionError: tensor QWenForCausalLM/transformer/layers/0/attention/qkv/add/elementwise_binary/ELEMENTWISE_SUM_0_output_0 has an invalid shape

additional notes

I want to add trtllm-build gemm_plugin int8

Kefeng-Duan commented 2 months ago

Hi, @xiangxinhello , Could you help to provide your /tmp/Qwen/7B/config.json file?

Kefeng-Duan commented 2 months ago

@nv-guomingz for vis

xiangxinhello commented 2 months ago

Hi, @xiangxinhello , Could you help to provide your /tmp/Qwen/7B/config.json file?

Hi, @Kefeng-Duan { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 32, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.37.0", "use_cache": true, "use_sliding_window": false, "vocab_size": 151936 }

xiangxinhello commented 2 months ago

@nv-guomingz for vis

Hi @nv-guomingz { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 32, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.37.0", "use_cache": true, "use_sliding_window": false, "vocab_size": 151936 }

nv-guomingz commented 2 months ago

I think the modification you applied on def set_smooth_quant_plugins doesn't make sense here. Because you convert the model by specifying the quantization mode to weights-only(W8A16) and set_smooth_quant_plugins is a different thing which uses for smoothquant (W8A8) algorithm. If you wanna to use smoothquant, just use below command as alternative. python3 convert_checkpoint.py --model_dir ./Qwen-7B-Chat --output_dir ./qwen_ckpt --dtype float16 --tp_size 1 --smoothquant 0.5 --per_channel --per_token

xiangxinhello commented 2 months ago

I think the modification you applied on def set_smooth_quant_plugins doesn't make sense here. Because you convert the model by specifying the quantization mode to weights-only(W8A16) and set_smooth_quant_plugins is a different thing which uses for smoothquant (W8A8) algorithm. If you wanna to use smoothquant, just use below command as alternative. python3 convert_checkpoint.py --model_dir ./Qwen-7B-Chat --output_dir ./qwen_ckpt --dtype float16 --tp_size 1 --smoothquant 0.5 --per_channel --per_token

How to use the trt-build --gemm_plugin Command? The --gemm_plugin option does not support int8. If you use float16, the final matrix multiplication will still be performed using float16.

nv-guomingz commented 2 months ago

It depends on the case by case. For int8 wo, the matrix mulitiplication uses fp16 is expected behaviour, right?

xiangxinhello commented 2 months ago

It depends on the case by case. For int8 wo, the matrix mulitiplication uses fp16 is expected behaviour, right?

I want to the matrix mulitiplication use int8, but trt-build --gemm_plugin can't support int8.

I just tried these smoothquant commands, and using the trt-build --gemm_plugin float16 command results in an error. However, when I don't use these quantization options, I'm able to successfully create the engine.

huggingface--Qwen1.5-7B-Chat

python convert_checkpoint.py --model_dir /workspace/mnt/storage/trt/Qwen1.5-7B-Chat/ --output_dir ./tllm_checkpoint_1gpu_sq_test --dtype float16 --smoothquant 0.5

trtllm-build --checkpoint_dir ./tllm_sq_1gpu/ --output_dir ./qwen1.5-7B-chat-sq/trt_engines/int8/1-gpu --max_output_len 1024 --gemm_plugin float16

[08/20/2024-02:33:41] [TRT] [W] IElementWiseLayer with inputs QWenForCausalLM/transformer/layers/0/attention/qkv/smooth_quant_gemm/PLUGIN_V2_SmoothQuantGemm_0_output_0 and QWenForCausalLM/transformer/layers/0/attention/qkv/add/elementwise_binary/broadcast_helper/expand_dims_like/expand_dims/view/SHUFFLE_0_output_0: first input has type Int8 but second input has type Half. [08/20/2024-02:33:41] [TRT] [E] ITensor::getDimensions: Error Code 4: Internal Error (QWenForCausalLM/transformer/layers/0/attention/qkv/add/elementwise_binary/ELEMENTWISE_SUM_0: ElementWiseOperation SUM must have same input types. But they are of types Int8 and Half.) Traceback (most recent call last): File "/root/anaconda3/envs/trt_llm/bin/trtllm-build", line 8, in sys.exit(main()) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 551, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 373, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 340, in build_and_save engine = build_model(build_config, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 333, in build_model return build(model, build_config) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 890, in build model(inputs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 713, in forward hidden_states = self.transformer.forward(kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 196, in forward hidden_states = self.layers.forward(hidden_states, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 327, in forward hidden_states = layer( File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 121, in forward attention_output = self.attention( File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, *kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/quantization/layers.py", line 1222, in forward qkv = self.qkv(hidden_states) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, **kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/quantization/layers.py", line 147, in forward x = x + self.bias.value File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 321, in add return add(self, b) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 2825, in elementwise_binary return _create_tensor(layer.get_output(0), layer) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 607, in _create_tensor assert trt_tensor.shape.len( AssertionError: tensor QWenForCausalLM/transformer/layers/0/attention/qkv/add/elementwise_binary/ELEMENTWISE_SUM_0_output_0 has an invalid shape

nv-guomingz commented 2 months ago

try below comamnds with latest trtllm again. My local testings is pass.

python convert_checkpoint.py
--model_dir /workspace/mnt/storage/trt/Qwen1.5-7B-Chat/ --output_dir ./tllm_checkpoint_1gpu_sq_test --dtype float16 --smoothquant 0.5

trtllm-build --checkpoint_dir ./tllm_sq_1gpu/ --output_dir ./qwen1.5-7B-chat-sq/trt_engines/int8/1-gpu --max_output_len 1024 --gemm_plugin float16
xiangxinhello commented 2 months ago

try below comamnds with latest trtllm again. My local testings is pass.

python convert_checkpoint.py
--model_dir /workspace/mnt/storage/trt/Qwen1.5-7B-Chat/ --output_dir ./tllm_checkpoint_1gpu_sq_test --dtype float16 --smoothquant 0.5

trtllm-build --checkpoint_dir ./tllm_sq_1gpu/ --output_dir ./qwen1.5-7B-chat-sq/trt_engines/int8/1-gpu --max_output_len 1024 --gemm_plugin float16

Sorry, I modified the trt-llm source code, and now I've successfully created an engine with --smoothquant 0.5 --gemm_plugin float16 . However, the trt-build --gemm_plugin does not support int8.this only support "gemm_plugin": ["auto", "float16", "float32", "bfloat16", "int32", "fp8", None]

nv-guomingz commented 2 months ago

IIRC, TRTLLM will call smoothquant plugin when you didn't set the --gemm_plugin to None and generate the ckpt with --smoothquant knob. Check this for details. So my personal habit is to set --gemm_plugin field to auto for most cases.

xiangxinhello commented 2 months ago

IIRC, TRTLLM will call smoothquant plugin when you didn't set the --gemm_plugin to None and generate the ckpt with --smoothquant knob. Check this for details. So my personal habit is to set --gemm_plugin field to auto for most cases.

Hi @nv-guomingz, I believe that if we could use --gemm_plugin int8, it would definitely be faster than --gemm_plugin float16.

Could you provide --gemm_plugin int8 this function? Thanks.

xinliu9451 commented 2 months ago

I got an error when I tried: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/README.md#distil-whisper My command was: !trtllm-build --checkpoint_dir distil_whisper_medium_en_weights_int8/encoder \ --output_dir distil_whisper_medium_en_int8/encoder \ --paged_kv_cache disable \ --moe_plugin disable \ --enable_xqa disable \ ---max_batch_size 8 \ --gemm_plugin disable \ --bert_attention_plugin float16 \ --remove_input_padding disable \ ---max_input_len 1500

The error report that occurred reads: 2024-08-20 07:45:07.071785: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-20 07:45:07.092394: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-20 07:45:07.098718: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-08-20 07:45:07.113933: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-08-20 07:45:08.198287: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024081300 [08/20/2024-07:45:09] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set nccl_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lookup_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lora_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set moe_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set context_fmha to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set remove_input_padding to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set reduce_fusion to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set enable_xqa to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set tokens_per_block to 64. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_paged_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set multiple_profiles to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set streamingllm to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_kv_cache to False. [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 80 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_audio_ctx = 1500 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 99 [08/20/2024-07:45:09] [TRT-LLM] [I] Compute capability: (7, 5) [08/20/2024-07:45:09] [TRT-LLM] [I] SM count: 40 [08/20/2024-07:45:09] [TRT-LLM] [I] SM clock: 1590 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] int4 TFLOPS: 260 [08/20/2024-07:45:09] [TRT-LLM] [I] int8 TFLOPS: 130 [08/20/2024-07:45:09] [TRT-LLM] [I] fp8 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float16 TFLOPS: 65 [08/20/2024-07:45:09] [TRT-LLM] [I] bfloat16 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float32 TFLOPS: 8 [08/20/2024-07:45:09] [TRT-LLM] [I] Total Memory: 15 GiB [08/20/2024-07:45:09] [TRT-LLM] [I] Memory clock: 5001 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bus width: 256 [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bandwidth: 320 GB/s [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe speed: 2500 Mbps [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe link width: 16 [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [I] Set dtype to float16. [08/20/2024-07:45:09] [TRT-LLM] [W] Overriding paged_state to False [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to False. [08/20/2024-07:45:09] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 2048 [08/20/2024-07:45:09] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored. [08/20/2024-07:45:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 213, GPU 103 (MiB) [08/20/2024-07:45:11] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +904, GPU +180, now: CPU 1272, GPU 283 (MiB) [08/20/2024-07:45:11] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [08/20/2024-07:45:11] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16. [08/20/2024-07:45:11] [TRT-LLM] [I] Set nccl_plugin to None. [08/20/2024-07:45:11] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0: INormalizationLayer input and scale must have identical types. input type is Half and scale type is Float.) Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 528, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 361, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 354, in build_model return build(model, build_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 1101, in build model(inputs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1915, in forward hidden_states = encoder_layer(hidden_states, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 243, in forward hidden_states = self.attention_layernorm(hidden_states) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/normalization.py", line 49, in forward return layer_norm(x, self.normalized_shape, weight, bias, self.eps) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 5155, in layer_norm return _create_tensor(layer.get_output(0), layer) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor assert trt_tensor.shape.len( AssertionError: tensor WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0_output_0 has an invalid shape

nv-guomingz commented 2 months ago

I got an error when I tried: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/README.md#distil-whisper My command was: !trtllm-build --checkpoint_dir distil_whisper_medium_en_weights_int8/encoder --output_dir distil_whisper_medium_en_int8/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable ---max_batch_size 8 --gemm_plugin disable \ --bert_attention_plugin float16 --remove_input_padding disable ---max_input_len 1500

The error report that occurred reads: 2024-08-20 07:45:07.071785: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-20 07:45:07.092394: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-20 07:45:07.098718: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-08-20 07:45:07.113933: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-08-20 07:45:08.198287: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024081300 [08/20/2024-07:45:09] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set nccl_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lookup_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lora_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set moe_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set context_fmha to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set remove_input_padding to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set reduce_fusion to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set enable_xqa to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set tokens_per_block to 64. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_paged_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set multiple_profiles to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set streamingllm to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_kv_cache to False. [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 80 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_audio_ctx = 1500 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 99 [08/20/2024-07:45:09] [TRT-LLM] [I] Compute capability: (7, 5) [08/20/2024-07:45:09] [TRT-LLM] [I] SM count: 40 [08/20/2024-07:45:09] [TRT-LLM] [I] SM clock: 1590 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] int4 TFLOPS: 260 [08/20/2024-07:45:09] [TRT-LLM] [I] int8 TFLOPS: 130 [08/20/2024-07:45:09] [TRT-LLM] [I] fp8 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float16 TFLOPS: 65 [08/20/2024-07:45:09] [TRT-LLM] [I] bfloat16 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float32 TFLOPS: 8 [08/20/2024-07:45:09] [TRT-LLM] [I] Total Memory: 15 GiB [08/20/2024-07:45:09] [TRT-LLM] [I] Memory clock: 5001 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bus width: 256 [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bandwidth: 320 GB/s [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe speed: 2500 Mbps [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe link width: 16 [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [I] Set dtype to float16. [08/20/2024-07:45:09] [TRT-LLM] [W] Overriding paged_state to False [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to False. [08/20/2024-07:45:09] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 2048 [08/20/2024-07:45:09] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored. [08/20/2024-07:45:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 213, GPU 103 (MiB) [08/20/2024-07:45:11] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +904, GPU +180, now: CPU 1272, GPU 283 (MiB) [08/20/2024-07:45:11] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [08/20/2024-07:45:11] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16. [08/20/2024-07:45:11] [TRT-LLM] [I] Set nccl_plugin to None. [08/20/2024-07:45:11] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0: INormalizationLayer input and scale must have identical types. input type is Half and scale type is Float.) Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 528, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 361, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 354, in build_model return build(model, build_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 1101, in build model(inputs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1915, in forward hidden_states = encoder_layer(hidden_states, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call* output = self.forward(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 243, in forward hidden_states = self.attention_layernorm(hidden_states) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/normalization.py", line 49, in forward return layer_norm(x, self.normalized_shape, weight, bias, self.eps) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 5155, in layer_norm return _create_tensor(layer.get_output(0), layer) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor assert trt_tensor.shape.len**( AssertionError: tensor WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0_output_0 has an invalid shape

Please file a dedicated ticket for tracking.

xinliu9451 commented 2 months ago

The error report that occurred reads: 2024-08-20 07:45:07.071785: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-20 07:45:07.092394: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-20 07:45:07.098718: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-08-20 07:45:07.113933: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-08-20 07:45:08.198287: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024081300 [08/20/2024-07:45:09] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set nccl_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lookup_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lora_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set moe_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set context_fmha to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set remove_input_padding to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set reduce_fusion to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set enable_xqa to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set tokens_per_block to 64. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_paged_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set multiple_profiles to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set streamingllm to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_kv_cache to False. [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 80 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_audio_ctx = 1500 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 99 [08/20/2024-07:45:09] [TRT-LLM] [I] Compute capability: (7, 5) [08/20/2024-07:45:09] [TRT-LLM] [I] SM count: 40 [08/20/2024-07:45:09] [TRT-LLM] [I] SM clock: 1590 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] int4 TFLOPS: 260 [08/20/2024-07:45:09] [TRT-LLM] [I] int8 TFLOPS: 130 [08/20/2024-07:45:09] [TRT-LLM] [I] fp8 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float16 TFLOPS: 65 [08/20/2024-07:45:09] [TRT-LLM] [I] bfloat16 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float32 TFLOPS: 8 [08/20/2024-07:45:09] [TRT-LLM] [I] Total Memory: 15 GiB [08/20/2024-07:45:09] [TRT-LLM] [I] Memory clock: 5001 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bus width: 256 [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bandwidth: 320 GB/s [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe speed: 2500 Mbps [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe link width: 16 [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [I] Set dtype to float16. [08/20/2024-07:45:09] [TRT-LLM] [W] Overriding paged_state to False [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to False. [08/20/2024-07:45:09] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 2048 [08/20/2024-07:45:09] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored. [08/20/2024-07:45:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 213, GPU 103 (MiB) [08/20/2024-07:45:11] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +904, GPU +180, now: CPU 1272, GPU 283 (MiB) [08/20/2024-07:45:11] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [08/20/2024-07:45:11] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16. [08/20/2024-07:45:11] [TRT-LLM] [I] Set nccl_plugin to None. [08/20/2024-07:45:11] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0: INormalizationLayer input and scale must have identical types. input type is Half and scale type is Float.) Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 528, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 361, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 354, in build_model return build(model, build_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 1101, in build model(inputs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1915, in forward hidden_states = encoder_layer(hidden_states, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 243, in forward hidden_states = self.attention_layernorm(hidden_states) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/normalization.py", line 49, in forward return layer_norm(x, self.normalized_shape, weight, bias, self.eps) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 5155, in layer_norm return _create_tensor(layer.get_output(0), layer) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor assert trt_tensor.shape.len( AssertionError: tensor WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0_output_0 has an invalid shape

I got an error when I tried: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/README.md#distil-whisper My command was: !trtllm-build --checkpoint_dir distil_whisper_medium_en_weights_int8/encoder --output_dir distil_whisper_medium_en_int8/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable ---max_batch_size 8 --gemm_plugin disable \ --bert_attention_plugin float16 --remove_input_padding disable ---max_input_len 1500 The error report that occurred reads: 2024-08-20 07:45:07.071785: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-20 07:45:07.092394: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-20 07:45:07.098718: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-08-20 07:45:07.113933: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-08-20 07:45:08.198287: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024081300 [08/20/2024-07:45:09] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set nccl_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lookup_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lora_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set moe_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set context_fmha to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set remove_input_padding to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set reduce_fusion to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set enable_xqa to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set tokens_per_block to 64. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_paged_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set multiple_profiles to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set streamingllm to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_kv_cache to False. [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 80 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_audio_ctx = 1500 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 99 [08/20/2024-07:45:09] [TRT-LLM] [I] Compute capability: (7, 5) [08/20/2024-07:45:09] [TRT-LLM] [I] SM count: 40 [08/20/2024-07:45:09] [TRT-LLM] [I] SM clock: 1590 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] int4 TFLOPS: 260 [08/20/2024-07:45:09] [TRT-LLM] [I] int8 TFLOPS: 130 [08/20/2024-07:45:09] [TRT-LLM] [I] fp8 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float16 TFLOPS: 65 [08/20/2024-07:45:09] [TRT-LLM] [I] bfloat16 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float32 TFLOPS: 8 [08/20/2024-07:45:09] [TRT-LLM] [I] Total Memory: 15 GiB [08/20/2024-07:45:09] [TRT-LLM] [I] Memory clock: 5001 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bus width: 256 [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bandwidth: 320 GB/s [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe speed: 2500 Mbps [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe link width: 16 [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [I] Set dtype to float16. [08/20/2024-07:45:09] [TRT-LLM] [W] Overriding paged_state to False [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to False. [08/20/2024-07:45:09] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 2048 [08/20/2024-07:45:09] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored. [08/20/2024-07:45:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 213, GPU 103 (MiB) [08/20/2024-07:45:11] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +904, GPU +180, now: CPU 1272, GPU 283 (MiB) [08/20/2024-07:45:11] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [08/20/2024-07:45:11] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16. [08/20/2024-07:45:11] [TRT-LLM] [I] Set nccl_plugin to None. [08/20/2024-07:45:11] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0: INormalizationLayer input and scale must have identical types. input type is Half and scale type is Float.) Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 528, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 361, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 354, in build_model return build(model, build_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 1101, in build model(inputs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1915, in forward hidden_states = encoder_layer(hidden_states, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call* output = self.forward(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 243, in forward hidden_states = self.attention_layernorm(hidden_states) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/normalization.py", line 49, in forward return layer_norm(x, self.normalized_shape, weight, bias, self.eps) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 5155, in layer_norm return _create_tensor(layer.get_output(0), layer) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor assert trt_tensor.shape.len**( AssertionError: tensor WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0_output_0 has an invalid shape

Please file a dedicated ticket for tracking.

nv-guomingz commented 2 months ago

"float16", "float32", "bfloat16", "int32", "fp8", None] You can try my suggestion and use nsys to capture the actual kernel to see if the gemm plugin runs on int8 or fp16.