NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.68k stars 991 forks source link

can not run whisper on T4 #1865

Closed ZJU-lishuang closed 5 days ago

ZJU-lishuang commented 4 months ago

System Info

x86_64 755G nvidia T4 ubuntu 22.04

trtllm version : https://github.com/NVIDIA/TensorRT-LLM/archive/9691e12bce7ae1c126c435a049eb516eb119486c.zip

pip install tensorrt-llm==0.11.0.dev2024062500 --extra-index-url https://pypi.nvidia.com

Who can help?

@Tracin

Information

Tasks

Reproduction

just run as https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper#build-tensorrt-engines

Expected behavior

The commands can work.

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024062500
[06/28/2024-11:18:30] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set gemm_plugin to None.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set nccl_plugin to auto.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set lookup_plugin to None.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set lora_plugin to None.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set moe_plugin to None.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set context_fmha to True.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set paged_kv_cache to False.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set remove_input_padding to False.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set use_custom_all_reduce to False.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set reduce_fusion to False.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set multi_block_mode to False.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set enable_xqa to False.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set tokens_per_block to 64.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set multiple_profiles to False.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set paged_state to True.
[06/28/2024-11:18:30] [TRT-LLM] [I] Set streamingllm to False.
[06/28/2024-11:18:30] [TRT-LLM] [I] Compute capability: (7, 5)
[06/28/2024-11:18:30] [TRT-LLM] [I] SM count: 40
[06/28/2024-11:18:30] [TRT-LLM] [I] SM clock: 1590 MHz
[06/28/2024-11:18:30] [TRT-LLM] [I] int4 TFLOPS: 260
[06/28/2024-11:18:30] [TRT-LLM] [I] int8 TFLOPS: 130
[06/28/2024-11:18:30] [TRT-LLM] [I] fp8 TFLOPS: 0
[06/28/2024-11:18:30] [TRT-LLM] [I] float16 TFLOPS: 65
[06/28/2024-11:18:30] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[06/28/2024-11:18:30] [TRT-LLM] [I] float32 TFLOPS: 8
[06/28/2024-11:18:30] [TRT-LLM] [I] Total Memory: 15 GiB
[06/28/2024-11:18:30] [TRT-LLM] [I] Memory clock: 5001 MHz
[06/28/2024-11:18:30] [TRT-LLM] [I] Memory bus width: 256
[06/28/2024-11:18:30] [TRT-LLM] [I] Memory bandwidth: 320 GB/s
[06/28/2024-11:18:30] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[06/28/2024-11:18:30] [TRT-LLM] [I] PCIe link width: 16
[06/28/2024-11:18:30] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s
[06/28/2024-11:18:30] [TRT-LLM] [W] remove_input_padding is not enabled, the specified max_num_tokens/opt_num_tokens will be ignored.
[06/28/2024-11:18:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_mels = 128
[06/28/2024-11:18:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.n_audio_ctx = 1500
[06/28/2024-11:18:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_languages = 100
[06/28/2024-11:18:30] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[06/28/2024-11:18:30] [TRT-LLM] [I] Set dtype to float16.
[06/28/2024-11:18:30] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 146, GPU 105 (MiB)
[06/28/2024-11:18:32] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +901, GPU +180, now: CPU 1195, GPU 285 (MiB)
[06/28/2024-11:18:32] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[06/28/2024-11:18:32] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to float16.
[06/28/2024-11:18:32] [TRT-LLM] [W] allreduce algorithm is selected automatically during execution now. use_custom_all_reduce will be deprecated in future releases. 
[06/28/2024-11:18:32] [TRT-LLM] [I] Set nccl_plugin to None.
[06/28/2024-11:18:32] [TRT-LLM] [I] Set use_custom_all_reduce to False.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 64 in sm_{75}.
[06/28/2024-11:18:32] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[06/28/2024-11:18:32] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[06/28/2024-11:18:33] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[06/28/2024-11:18:33] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[06/28/2024-11:18:49] [TRT] [E] Error Code: 10: Could not find any implementation for node WhisperEncoder/conv1/conv1d/CONVOLUTION_0 + PWN(PWN(PWN(PWN(PWN(PWN(WhisperEncoder/gelu/elementwise_binary/broadcast_helper/constant_to_tensor_/constant/CONSTANT_0 + WhisperEncoder/gelu/elementwise_binary/broadcast_helper/expand_dims_like/expand_dims/view/SHUFFLE_0 + unsqueeze_node_after_WhisperEncoder/gelu/elementwise_binary/broadcast_helper/constant_to_tensor_/constant/CONSTANT_0 + WhisperEncoder/gelu/elementwise_binary/broadcast_hel
[06/28/2024-11:18:49] [TRT] [E] IBuilder::buildSerializedNetwork: Error Code 10: Internal Error (Could not find any implementation for node WhisperEncoder/conv1/conv1d/CONVOLUTION_0 + PWN(PWN(PWN(PWN(PWN(PWN(WhisperEncoder/gelu/elementwise_binary/broadcast_helper/constant_to_tensor_/constant/CONSTANT_0 + WhisperEncoder/gelu/elementwise_binary/broadcast_helper/expand_dims_like/expand_dims/view/SHUFFLE_0 + unsqueeze_node_after_WhisperEncoder/gelu/elementwise_binary/broadcast_helper/constant_to_tensor_/constant/CONSTANT_0 + Whi
[06/28/2024-11:18:49] [TRT-LLM] [E] Engine building failed, please check the error log.
[06/28/2024-11:18:49] [TRT] [I] Serialized 26 bytes of code generator cache.
[06/28/2024-11:18:49] [TRT] [I] Serialized 10 timing cache entries
[06/28/2024-11:18:49] [TRT-LLM] [I] Timing cache serialized to model.cache
[06/28/2024-11:18:49] [TRT-LLM] [I] Total time of building all engines: 00:00:19

additional notes

Hope there is a way to solve it.

Thanks.

nv-guomingz commented 4 months ago

@ZJU-lishuang we confirmed its a TRT's bug and will fix it in the following release.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Skywalker-Harrison commented 3 months ago

Hi, has this bug been fixed now?

yuekaizhang commented 3 months ago

Hi, has this bug been fixed now?

@Skywalker-Harrison Yes, it has beed fixed on T4.

nv-guomingz commented 5 days ago

Hi @ZJU-lishuang please feel free to reopen this ticket if needed.