Whisper build fails with `--remove_input_padding` option

lightbooster commented 7 months ago

System Info

GPU V100, A100
docker image nvidia/cuda:12.1.0-devel-ubuntu22.04
tensorrt-llm 0.9.0.dev2024020600

Who can help?

@byshiue

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

running build.py script from examples/whisper with --remove_input_padding option:

python3 build.py --output_dir whisper_large_v3_no_pad --use_gpt_attention_plugin --use_gemm_plugin --use_bert_attention_plugin --model_dir /assets/ --model_name large-v3 --remove_input_padding

Expected behavior

serialized engine expected, without necessity for padding batch to 30 seconds samples [batch_size, n_mels, 3000]

actual behavior

actually build.py script will be failed with assertion in encoder attention layer:

Traceback (most recent call last):
  File "/TensorRT-LLM/examples/whisper/build.py", line 365, in <module>
    run_build(args)
  File "/TensorRT-LLM/examples/whisper/build.py", line 359, in run_build
    build_encoder(model, args)
  File "/TensorRT-LLM/examples/whisper/build.py", line 228, in build_encoder
    tensorrt_llm_whisper_encoder(*inputs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1406, in forward
    hidden_states = encoder_layer(hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 239, in forward
    attention_output = self.attention(hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/attention.py", line 1174, in forward
    assert qkv.ndim() == 2
AssertionError

additional notes

tensorrt_llm/layers/attention.py code snippet:

if default_net().plugin_config.remove_input_padding:
            assert qkv.ndim() == 2

actual shape of qkv:

BertAttention.forward.qkv.shape =  (-1, 1500, 3840)

without --remove_input_padding option, everything works fine and as expected

### Tasks

yuekaizhang commented 6 months ago

@lightbooster Hi, whisper has not supported this option yet. I would update here if it works or when we could remove the 30s restrictions.

wangsang123 commented 1 month ago

@lightbooster 嗨，whisper 尚未支持此选项。如果它有效或者我们可以移除 30 秒的限制，我会在这里更新。

if this is supported now?

yuekaizhang commented 1 month ago

if this is supported now?

Currently, for the distill-whispr or fine-tuned Whisper models, it is possible to configure audio other than 30 seconds. The --remove-input-padding option is also supported, but it does not actually remove the padding internally; it only supports the input and output format. Support for arbitrary length audio input has not yet been implemented.

wangsang123 commented 1 month ago

尚未实现对任意长度音频输入的支持。

We are using whipser for streaming speech recognition. Will this padding increase the amount of calculation at the beginning of the audio stream, and will the reasoning affect the speed?

yuekaizhang commented 1 month ago

We are using whipser for streaming speech recognition. Will this padding increase the amount of calculation at the beginning of the audio stream, and will the reasoning affect the speed?

It will increase computation, but it won't add too much because a large part of the model's time consumption is determined by the number of autoregressive steps of the decoder. Padding does not increase this number.

By the way, because training and inference must be consistent, the accuracy of the native Whisper model will be compromised if the input is audio other than 30 seconds.

wangsang123 commented 1 month ago

We are using whipser for streaming speech recognition. Will this padding increase the amount of calculation at the beginning of the audio stream, and will the reasoning affect the speed?

It will increase computation, but it won't add too much because a large part of the model's time consumption is determined by the number of autoregressive steps of the decoder. Padding does not increase this number.

By the way, because training and inference must be consistent, the accuracy of the native Whisper model will be compromised if the input is audio other than 30 seconds.

Thanks for the answer. We removed padding during training.

NVIDIA / TensorRT-LLM