How to add gemm_plugin int8

System Info

A100-PCIe-40GB Tensorrt-LLM-verison:0.11.0

Who can help?

@Tracin

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/ \ --output_dir ./tllm_checkpoint_1gpu_fp16_wq \ --dtype float16 \ --use_weight_only \ --weight_only_precision int8

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16_wq \ --output_dir ./tmp/qwen/7B/trt_engines/weight_only/1-gpu/ \ --gemm_plugin int8

self.smooth_quant_gemm_plugin = "int8" I set def set_smooth_quant_plugins(self, dtype: str = "auto"): self.smooth_quant_gemm_plugin = "int8" self.rmsnorm_quantization_plugin = dtype self.layernorm_quantization_plugin = dtype self.quantize_per_token_plugin = True self.quantize_tensor_plugin = True return self

Expected behavior

Engine created succesfully.

actual behavior

but error: [08/16/2024-08:54:52] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to int8. [08/16/2024-08:54:52] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to float16. [08/16/2024-08:54:52] [TRT-LLM] [I] Set layernorm_quantization_plugin to float16. [08/16/2024-08:54:52] [TRT-LLM] [I] Set quantize_per_token_plugin to True. [08/16/2024-08:54:52] [TRT-LLM] [I] Set quantize_tensor_plugin to True. [08/16/2024-08:54:52] [TRT-LLM] [I] Set nccl_plugin to None. [08/16/2024-08:54:52] [TRT-LLM] [I] Set use_custom_all_reduce to True. [08/16/2024-08:54:52] [TRT] [W] IElementWiseLayer with inputs QWenForCausalLM/transformer/layers/0/attention/qkv/smooth_quant_gemm/PLUGIN_V2_SmoothQuantGemm_0_output_0 and QWenForCausalLM/transformer/layers/0/attention/qkv/add/elementwise_binary/broadcast_helper/expand_dims_like/expand_dims/view/SHUFFLE_0_output_0: first input has type Int8 but second input has type Half. [08/16/2024-08:54:52] [TRT] [E] ITensor::getDimensions: Error Code 4: Internal Error (QWenForCausalLM/transformer/layers/0/attention/qkv/add/elementwise_binary/ELEMENTWISE_SUM_0: ElementWiseOperation SUM must have same input types. But they are of types Int8 and Half.) Traceback (most recent call last): File "/root/anaconda3/envs/trt_llm/bin/trtllm-build", line 8, in sys.exit(main()) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 551, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 373, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 340, in build_and_save engine = build_model(build_config, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 333, in build_model return build(model, build_config) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 890, in build model(inputs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 713, in forward hidden_states = self.transformer.forward(kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 196, in forward hidden_states = self.layers.forward(hidden_states, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 327, in forward hidden_states = layer( File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 121, in forward attention_output = self.attention( File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, *kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/quantization/layers.py", line 1222, in forward qkv = self.qkv(hidden_states) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, **kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/quantization/layers.py", line 147, in forward x = x + self.bias.value File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 321, in add return add(self, b) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 2825, in elementwise_binary return _create_tensor(layer.get_output(0), layer) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 607, in _create_tensor assert trt_tensor.shape.len( AssertionError: tensor QWenForCausalLM/transformer/layers/0/attention/qkv/add/elementwise_binary/ELEMENTWISE_SUM_0_output_0 has an invalid shape

additional notes

I want to add trtllm-build gemm_plugin int8

Hi, @xiangxinhello , Could you help to provide your /tmp/Qwen/7B/config.json file?

@nv-guomingz for vis

Hi, @xiangxinhello , Could you help to provide your /tmp/Qwen/7B/config.json file?

Hi, @Kefeng-Duan { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 32, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.37.0", "use_cache": true, "use_sliding_window": false, "vocab_size": 151936 }

@nv-guomingz for vis

Hi @nv-guomingz { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 32, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.37.0", "use_cache": true, "use_sliding_window": false, "vocab_size": 151936 }

I think the modification you applied on def set_smooth_quant_plugins doesn't make sense here. Because you convert the model by specifying the quantization mode to weights-only(W8A16) and set_smooth_quant_plugins is a different thing which uses for smoothquant (W8A8) algorithm. If you wanna to use smoothquant, just use below command as alternative. python3 convert_checkpoint.py --model_dir ./Qwen-7B-Chat --output_dir ./qwen_ckpt --dtype float16 --tp_size 1 --smoothquant 0.5 --per_channel --per_token

I think the modification you applied on def set_smooth_quant_plugins doesn't make sense here. Because you convert the model by specifying the quantization mode to weights-only(W8A16) and set_smooth_quant_plugins is a different thing which uses for smoothquant (W8A8) algorithm. If you wanna to use smoothquant, just use below command as alternative. python3 convert_checkpoint.py --model_dir ./Qwen-7B-Chat --output_dir ./qwen_ckpt --dtype float16 --tp_size 1 --smoothquant 0.5 --per_channel --per_token

How to use the trt-build --gemm_plugin Command? The --gemm_plugin option does not support int8. If you use float16, the final matrix multiplication will still be performed using float16.

It depends on the case by case. For int8 wo, the matrix mulitiplication uses fp16 is expected behaviour, right?

It depends on the case by case. For int8 wo, the matrix mulitiplication uses fp16 is expected behaviour, right?

I want to the matrix mulitiplication use int8, but trt-build --gemm_plugin can't support int8.

I just tried these smoothquant commands, and using the trt-build --gemm_plugin float16 command results in an error. However, when I don't use these quantization options, I'm able to successfully create the engine.

huggingface--Qwen1.5-7B-Chat

python convert_checkpoint.py --model_dir /workspace/mnt/storage/trt/Qwen1.5-7B-Chat/ --output_dir ./tllm_checkpoint_1gpu_sq_test --dtype float16 --smoothquant 0.5

trtllm-build --checkpoint_dir ./tllm_sq_1gpu/ --output_dir ./qwen1.5-7B-chat-sq/trt_engines/int8/1-gpu --max_output_len 1024 --gemm_plugin float16

[08/20/2024-02:33:41] [TRT] [W] IElementWiseLayer with inputs QWenForCausalLM/transformer/layers/0/attention/qkv/smooth_quant_gemm/PLUGIN_V2_SmoothQuantGemm_0_output_0 and QWenForCausalLM/transformer/layers/0/attention/qkv/add/elementwise_binary/broadcast_helper/expand_dims_like/expand_dims/view/SHUFFLE_0_output_0: first input has type Int8 but second input has type Half. [08/20/2024-02:33:41] [TRT] [E] ITensor::getDimensions: Error Code 4: Internal Error (QWenForCausalLM/transformer/layers/0/attention/qkv/add/elementwise_binary/ELEMENTWISE_SUM_0: ElementWiseOperation SUM must have same input types. But they are of types Int8 and Half.) Traceback (most recent call last): File "/root/anaconda3/envs/trt_llm/bin/trtllm-build", line 8, in sys.exit(main()) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 551, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 373, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 340, in build_and_save engine = build_model(build_config, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 333, in build_model return build(model, build_config) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 890, in build model(inputs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 713, in forward hidden_states = self.transformer.forward(kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 196, in forward hidden_states = self.layers.forward(hidden_states, File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 327, in forward hidden_states = layer( File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/models/qwen/model.py", line 121, in forward attention_output = self.attention( File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, *kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/quantization/layers.py", line 1222, in forward qkv = self.qkv(hidden_states) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, **kwargs) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/quantization/layers.py", line 147, in forward x = x + self.bias.value File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 321, in add return add(self, b) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 2825, in elementwise_binary return _create_tensor(layer.get_output(0), layer) File "/root/anaconda3/envs/trt_llm/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 607, in _create_tensor assert trt_tensor.shape.len( AssertionError: tensor QWenForCausalLM/transformer/layers/0/attention/qkv/add/elementwise_binary/ELEMENTWISE_SUM_0_output_0 has an invalid shape

try below comamnds with latest trtllm again. My local testings is pass.

python convert_checkpoint.py
--model_dir /workspace/mnt/storage/trt/Qwen1.5-7B-Chat/ --output_dir ./tllm_checkpoint_1gpu_sq_test --dtype float16 --smoothquant 0.5

trtllm-build --checkpoint_dir ./tllm_sq_1gpu/ --output_dir ./qwen1.5-7B-chat-sq/trt_engines/int8/1-gpu --max_output_len 1024 --gemm_plugin float16

try below comamnds with latest trtllm again. My local testings is pass.

python convert_checkpoint.py
--model_dir /workspace/mnt/storage/trt/Qwen1.5-7B-Chat/ --output_dir ./tllm_checkpoint_1gpu_sq_test --dtype float16 --smoothquant 0.5

trtllm-build --checkpoint_dir ./tllm_sq_1gpu/ --output_dir ./qwen1.5-7B-chat-sq/trt_engines/int8/1-gpu --max_output_len 1024 --gemm_plugin float16

Sorry, I modified the trt-llm source code, and now I've successfully created an engine with --smoothquant 0.5 --gemm_plugin float16 . However, the trt-build --gemm_plugin does not support int8.this only support "gemm_plugin": ["auto", "float16", "float32", "bfloat16", "int32", "fp8", None]

IIRC, TRTLLM will call smoothquant plugin when you didn't set the --gemm_plugin to None and generate the ckpt with --smoothquant knob. Check this for details. So my personal habit is to set --gemm_plugin field to auto for most cases.

IIRC, TRTLLM will call smoothquant plugin when you didn't set the --gemm_plugin to None and generate the ckpt with --smoothquant knob. Check this for details. So my personal habit is to set --gemm_plugin field to auto for most cases.

Hi @nv-guomingz, I believe that if we could use --gemm_plugin int8, it would definitely be faster than --gemm_plugin float16.

Could you provide --gemm_plugin int8 this function? Thanks.

I got an error when I tried: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/README.md#distil-whisper My command was: !trtllm-build --checkpoint_dir distil_whisper_medium_en_weights_int8/encoder \ --output_dir distil_whisper_medium_en_int8/encoder \ --paged_kv_cache disable \ --moe_plugin disable \ --enable_xqa disable \ ---max_batch_size 8 \ --gemm_plugin disable \ --bert_attention_plugin float16 \ --remove_input_padding disable \ ---max_input_len 1500

The error report that occurred reads: 2024-08-20 07:45:07.071785: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-20 07:45:07.092394: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-20 07:45:07.098718: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-08-20 07:45:07.113933: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-08-20 07:45:08.198287: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024081300 [08/20/2024-07:45:09] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set nccl_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lookup_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lora_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set moe_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set context_fmha to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set remove_input_padding to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set reduce_fusion to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set enable_xqa to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set tokens_per_block to 64. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_paged_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set multiple_profiles to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set streamingllm to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_kv_cache to False. [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 80 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_audio_ctx = 1500 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 99 [08/20/2024-07:45:09] [TRT-LLM] [I] Compute capability: (7, 5) [08/20/2024-07:45:09] [TRT-LLM] [I] SM count: 40 [08/20/2024-07:45:09] [TRT-LLM] [I] SM clock: 1590 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] int4 TFLOPS: 260 [08/20/2024-07:45:09] [TRT-LLM] [I] int8 TFLOPS: 130 [08/20/2024-07:45:09] [TRT-LLM] [I] fp8 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float16 TFLOPS: 65 [08/20/2024-07:45:09] [TRT-LLM] [I] bfloat16 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float32 TFLOPS: 8 [08/20/2024-07:45:09] [TRT-LLM] [I] Total Memory: 15 GiB [08/20/2024-07:45:09] [TRT-LLM] [I] Memory clock: 5001 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bus width: 256 [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bandwidth: 320 GB/s [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe speed: 2500 Mbps [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe link width: 16 [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [I] Set dtype to float16. [08/20/2024-07:45:09] [TRT-LLM] [W] Overriding paged_state to False [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to False. [08/20/2024-07:45:09] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 2048 [08/20/2024-07:45:09] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored. [08/20/2024-07:45:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 213, GPU 103 (MiB) [08/20/2024-07:45:11] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +904, GPU +180, now: CPU 1272, GPU 283 (MiB) [08/20/2024-07:45:11] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [08/20/2024-07:45:11] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16. [08/20/2024-07:45:11] [TRT-LLM] [I] Set nccl_plugin to None. [08/20/2024-07:45:11] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0: INormalizationLayer input and scale must have identical types. input type is Half and scale type is Float.) Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 528, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 361, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 354, in build_model return build(model, build_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 1101, in build model(inputs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1915, in forward hidden_states = encoder_layer(hidden_states, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 243, in forward hidden_states = self.attention_layernorm(hidden_states) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/normalization.py", line 49, in forward return layer_norm(x, self.normalized_shape, weight, bias, self.eps) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 5155, in layer_norm return _create_tensor(layer.get_output(0), layer) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor assert trt_tensor.shape.len( AssertionError: tensor WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0_output_0 has an invalid shape

I got an error when I tried: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/README.md#distil-whisper My command was: !trtllm-build --checkpoint_dir distil_whisper_medium_en_weights_int8/encoder --output_dir distil_whisper_medium_en_int8/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable ---max_batch_size 8 --gemm_plugin disable \ --bert_attention_plugin float16 --remove_input_padding disable ---max_input_len 1500

The error report that occurred reads: 2024-08-20 07:45:07.071785: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-20 07:45:07.092394: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-20 07:45:07.098718: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-08-20 07:45:07.113933: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-08-20 07:45:08.198287: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024081300 [08/20/2024-07:45:09] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set nccl_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lookup_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lora_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set moe_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set context_fmha to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set remove_input_padding to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set reduce_fusion to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set enable_xqa to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set tokens_per_block to 64. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_paged_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set multiple_profiles to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set streamingllm to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_kv_cache to False. [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 80 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_audio_ctx = 1500 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 99 [08/20/2024-07:45:09] [TRT-LLM] [I] Compute capability: (7, 5) [08/20/2024-07:45:09] [TRT-LLM] [I] SM count: 40 [08/20/2024-07:45:09] [TRT-LLM] [I] SM clock: 1590 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] int4 TFLOPS: 260 [08/20/2024-07:45:09] [TRT-LLM] [I] int8 TFLOPS: 130 [08/20/2024-07:45:09] [TRT-LLM] [I] fp8 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float16 TFLOPS: 65 [08/20/2024-07:45:09] [TRT-LLM] [I] bfloat16 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float32 TFLOPS: 8 [08/20/2024-07:45:09] [TRT-LLM] [I] Total Memory: 15 GiB [08/20/2024-07:45:09] [TRT-LLM] [I] Memory clock: 5001 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bus width: 256 [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bandwidth: 320 GB/s [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe speed: 2500 Mbps [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe link width: 16 [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [I] Set dtype to float16. [08/20/2024-07:45:09] [TRT-LLM] [W] Overriding paged_state to False [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to False. [08/20/2024-07:45:09] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 2048 [08/20/2024-07:45:09] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored. [08/20/2024-07:45:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 213, GPU 103 (MiB) [08/20/2024-07:45:11] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +904, GPU +180, now: CPU 1272, GPU 283 (MiB) [08/20/2024-07:45:11] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [08/20/2024-07:45:11] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16. [08/20/2024-07:45:11] [TRT-LLM] [I] Set nccl_plugin to None. [08/20/2024-07:45:11] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0: INormalizationLayer input and scale must have identical types. input type is Half and scale type is Float.) Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 528, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 361, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 354, in build_model return build(model, build_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 1101, in build model(inputs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1915, in forward hidden_states = encoder_layer(hidden_states, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call* output = self.forward(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 243, in forward hidden_states = self.attention_layernorm(hidden_states) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/normalization.py", line 49, in forward return layer_norm(x, self.normalized_shape, weight, bias, self.eps) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 5155, in layer_norm return _create_tensor(layer.get_output(0), layer) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor assert trt_tensor.shape.len**( AssertionError: tensor WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0_output_0 has an invalid shape

Please file a dedicated ticket for tracking.

The error report that occurred reads: 2024-08-20 07:45:07.071785: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-20 07:45:07.092394: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-20 07:45:07.098718: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-08-20 07:45:07.113933: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-08-20 07:45:08.198287: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024081300 [08/20/2024-07:45:09] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set nccl_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lookup_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lora_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set moe_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set context_fmha to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set remove_input_padding to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set reduce_fusion to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set enable_xqa to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set tokens_per_block to 64. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_paged_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set multiple_profiles to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set streamingllm to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_kv_cache to False. [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 80 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_audio_ctx = 1500 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 99 [08/20/2024-07:45:09] [TRT-LLM] [I] Compute capability: (7, 5) [08/20/2024-07:45:09] [TRT-LLM] [I] SM count: 40 [08/20/2024-07:45:09] [TRT-LLM] [I] SM clock: 1590 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] int4 TFLOPS: 260 [08/20/2024-07:45:09] [TRT-LLM] [I] int8 TFLOPS: 130 [08/20/2024-07:45:09] [TRT-LLM] [I] fp8 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float16 TFLOPS: 65 [08/20/2024-07:45:09] [TRT-LLM] [I] bfloat16 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float32 TFLOPS: 8 [08/20/2024-07:45:09] [TRT-LLM] [I] Total Memory: 15 GiB [08/20/2024-07:45:09] [TRT-LLM] [I] Memory clock: 5001 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bus width: 256 [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bandwidth: 320 GB/s [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe speed: 2500 Mbps [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe link width: 16 [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [I] Set dtype to float16. [08/20/2024-07:45:09] [TRT-LLM] [W] Overriding paged_state to False [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to False. [08/20/2024-07:45:09] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 2048 [08/20/2024-07:45:09] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored. [08/20/2024-07:45:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 213, GPU 103 (MiB) [08/20/2024-07:45:11] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +904, GPU +180, now: CPU 1272, GPU 283 (MiB) [08/20/2024-07:45:11] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [08/20/2024-07:45:11] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16. [08/20/2024-07:45:11] [TRT-LLM] [I] Set nccl_plugin to None. [08/20/2024-07:45:11] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0: INormalizationLayer input and scale must have identical types. input type is Half and scale type is Float.) Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 528, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 361, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 354, in build_model return build(model, build_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 1101, in build model(inputs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1915, in forward hidden_states = encoder_layer(hidden_states, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 243, in forward hidden_states = self.attention_layernorm(hidden_states) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/normalization.py", line 49, in forward return layer_norm(x, self.normalized_shape, weight, bias, self.eps) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 5155, in layer_norm return _create_tensor(layer.get_output(0), layer) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor assert trt_tensor.shape.len( AssertionError: tensor WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0_output_0 has an invalid shape

I got an error when I tried: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/README.md#distil-whisper My command was: !trtllm-build --checkpoint_dir distil_whisper_medium_en_weights_int8/encoder --output_dir distil_whisper_medium_en_int8/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable ---max_batch_size 8 --gemm_plugin disable \ --bert_attention_plugin float16 --remove_input_padding disable ---max_input_len 1500 The error report that occurred reads: 2024-08-20 07:45:07.071785: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-20 07:45:07.092394: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-20 07:45:07.098718: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-08-20 07:45:07.113933: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-08-20 07:45:08.198287: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT [TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024081300 [08/20/2024-07:45:09] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set nccl_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lookup_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set lora_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set moe_plugin to None. [08/20/2024-07:45:09] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [08/20/2024-07:45:09] [TRT-LLM] [I] Set context_fmha to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set remove_input_padding to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set reduce_fusion to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set enable_xqa to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set tokens_per_block to 64. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_paged_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set multiple_profiles to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to True. [08/20/2024-07:45:09] [TRT-LLM] [I] Set streamingllm to False. [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_kv_cache to False. [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 80 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_audio_ctx = 1500 [08/20/2024-07:45:09] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 99 [08/20/2024-07:45:09] [TRT-LLM] [I] Compute capability: (7, 5) [08/20/2024-07:45:09] [TRT-LLM] [I] SM count: 40 [08/20/2024-07:45:09] [TRT-LLM] [I] SM clock: 1590 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] int4 TFLOPS: 260 [08/20/2024-07:45:09] [TRT-LLM] [I] int8 TFLOPS: 130 [08/20/2024-07:45:09] [TRT-LLM] [I] fp8 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float16 TFLOPS: 65 [08/20/2024-07:45:09] [TRT-LLM] [I] bfloat16 TFLOPS: 0 [08/20/2024-07:45:09] [TRT-LLM] [I] float32 TFLOPS: 8 [08/20/2024-07:45:09] [TRT-LLM] [I] Total Memory: 15 GiB [08/20/2024-07:45:09] [TRT-LLM] [I] Memory clock: 5001 MHz [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bus width: 256 [08/20/2024-07:45:09] [TRT-LLM] [I] Memory bandwidth: 320 GB/s [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe speed: 2500 Mbps [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe link width: 16 [08/20/2024-07:45:09] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [08/20/2024-07:45:09] [TRT-LLM] [I] Set dtype to float16. [08/20/2024-07:45:09] [TRT-LLM] [W] Overriding paged_state to False [08/20/2024-07:45:09] [TRT-LLM] [I] Set paged_state to False. [08/20/2024-07:45:09] [TRT-LLM] [I] max_seq_len is not specified, using deduced value 2048 [08/20/2024-07:45:09] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored. [08/20/2024-07:45:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 213, GPU 103 (MiB) [08/20/2024-07:45:11] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +904, GPU +180, now: CPU 1272, GPU 283 (MiB) [08/20/2024-07:45:11] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [08/20/2024-07:45:11] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16. [08/20/2024-07:45:11] [TRT-LLM] [I] Set nccl_plugin to None. [08/20/2024-07:45:11] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0: INormalizationLayer input and scale must have identical types. input type is Half and scale type is Float.) Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 528, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 394, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 361, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 354, in build_model return build(model, build_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 1101, in build model(inputs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1915, in forward hidden_states = encoder_layer(hidden_states, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call* output = self.forward(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 243, in forward hidden_states = self.attention_layernorm(hidden_states) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in call output = self.forward(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/normalization.py", line 49, in forward return layer_norm(x, self.normalized_shape, weight, bias, self.eps) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 5155, in layer_norm return _create_tensor(layer.get_output(0), layer) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor assert trt_tensor.shape.len**( AssertionError: tensor WhisperEncoder/encoder_layers/0/attention_layernorm/layer_norm_L5155/NORMALIZATION_0_output_0 has an invalid shape

Please file a dedicated ticket for tracking.

"float16", "float32", "bfloat16", "int32", "fp8", None] You can try my suggestion and use nsys to capture the actual kernel to see if the gemm plugin runs on int8 or fp16.

NVIDIA / TensorRT-LLM