NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.15k stars 897 forks source link

Whisper build fails with `--remove_input_padding` option #1081

Open lightbooster opened 7 months ago

lightbooster commented 7 months ago

System Info

Who can help?

@byshiue

Information

Tasks

Reproduction

running build.py script from examples/whisper with --remove_input_padding option:

python3 build.py --output_dir whisper_large_v3_no_pad --use_gpt_attention_plugin --use_gemm_plugin --use_bert_attention_plugin --model_dir /assets/ --model_name large-v3 --remove_input_padding

Expected behavior

serialized engine expected, without necessity for padding batch to 30 seconds samples [batch_size, n_mels, 3000]

actual behavior

actually build.py script will be failed with assertion in encoder attention layer:

Traceback (most recent call last):
  File "/TensorRT-LLM/examples/whisper/build.py", line 365, in <module>
    run_build(args)
  File "/TensorRT-LLM/examples/whisper/build.py", line 359, in run_build
    build_encoder(model, args)
  File "/TensorRT-LLM/examples/whisper/build.py", line 228, in build_encoder
    tensorrt_llm_whisper_encoder(*inputs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1406, in forward
    hidden_states = encoder_layer(hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 239, in forward
    attention_output = self.attention(hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/attention.py", line 1174, in forward
    assert qkv.ndim() == 2
AssertionError

additional notes

tensorrt_llm/layers/attention.py code snippet:

if default_net().plugin_config.remove_input_padding:
            assert qkv.ndim() == 2

actual shape of qkv:

BertAttention.forward.qkv.shape =  (-1, 1500, 3840)

without --remove_input_padding option, everything works fine and as expected

### Tasks
yuekaizhang commented 6 months ago

@lightbooster Hi, whisper has not supported this option yet. I would update here if it works or when we could remove the 30s restrictions.

wangsang123 commented 1 month ago

@lightbooster 嗨,whisper 尚未支持此选项。如果它有效或者我们可以移除 30 秒的限制,我会在这里更新。

if this is supported now?

yuekaizhang commented 1 month ago

if this is supported now?

Currently, for the distill-whispr or fine-tuned Whisper models, it is possible to configure audio other than 30 seconds. The --remove-input-padding option is also supported, but it does not actually remove the padding internally; it only supports the input and output format. Support for arbitrary length audio input has not yet been implemented.

wangsang123 commented 1 month ago

尚未实现对任意长度音频输入的支持。

We are using whipser for streaming speech recognition. Will this padding increase the amount of calculation at the beginning of the audio stream, and will the reasoning affect the speed?

yuekaizhang commented 1 month ago

We are using whipser for streaming speech recognition. Will this padding increase the amount of calculation at the beginning of the audio stream, and will the reasoning affect the speed?

It will increase computation, but it won't add too much because a large part of the model's time consumption is determined by the number of autoregressive steps of the decoder. Padding does not increase this number.

By the way, because training and inference must be consistent, the accuracy of the native Whisper model will be compromised if the input is audio other than 30 seconds.

wangsang123 commented 1 month ago

We are using whipser for streaming speech recognition. Will this padding increase the amount of calculation at the beginning of the audio stream, and will the reasoning affect the speed?

It will increase computation, but it won't add too much because a large part of the model's time consumption is determined by the number of autoregressive steps of the decoder. Padding does not increase this number.

By the way, because training and inference must be consistent, the accuracy of the native Whisper model will be compromised if the input is audio other than 30 seconds.

Thanks for the answer. We removed padding during training.