High WER and Incomplete Transcription Issue with Whisper

teith commented 1 month ago

System Info

GPU properties: GPU name: NVIDIA A100 GPU memory size: 80 GB

TensorRT-LLM branch: main TensorRT-LLM version: 0.11.0.dev2024052800

OS: Ubuntu 22.04

Who can help?

@kaiyux, @byshiue

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I follow official instruction to build and run Whisper:

python3 run.py --name single_wav_test --engine_dir $output_dir --input_file OSK.wav --assets_dir ./assets

Expected behavior

The expected behavior is a low Word Error Rate (WER) and a complete transcription of the input audio file.

actual behavior

The actual behaviour shows a very high WER (4900.00%) and incomplete transcription of the input audio. The transcription does not cover the entire content of the audio file.

additional notes

The full report:

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052800
prediction:  The little tales they tell are false. The door was barred, locked and bolted as well. Ripe pears are fit for a queen's table. A big wet stain was on the round carpet. The kite dipped and swayed but stayed aloft. The pleasant hours fly by much too soon.
RTF: 0.0298
total_duration: 52.415 seconds
(0.01 hours)
processing time: 1.560 seconds (0.00 hours)
batch size: 4
num_beams: 1
errs-single_wav_test.txt
%WER = 4900.00
Errors: 48 insertions, 0 deletions, 1 substitutions, over 1 reference words (0 correct)
Search below for sections starting with PER-UTT DETAILS:, SUBSTITUTIONS:, DELETIONS:, INSERTIONS:, PER-WORD STATS:
PER-UTT DETAILS: corr or (ref->hyp)  
0:  (->The little tales they tell are false. The door was barred, locked and bolted as well. Ripe pears are fit for a queen's table. A big wet stain was on the round carpet. The kite dipped and swayed but stayed aloft. The pleasant hours fly by much too soon.)
SUBSTITUTIONS: count ref -> hyp
1    -> The
DELETIONS: count ref
INSERTIONS: count hyp
3   The
2   was
2   are
2   and
1   wet
1   well.
1   too
1   they
1   the
1   tell
1   tales
1   table.
1   swayed
1   stayed
1   stain
1   soon.
1   round
1   queen's
1   pleasant
1   pears
1   on
1   much
1   locked
1   little
1   kite
1   hours
1   for
1   fly
1   fit
1   false.
1   door
1   dipped
1   carpet.
1   by
1   but
1   bolted
1   big
1   barred,
1   as
1   aloft.
1   a
1   Ripe
1   A
PER-WORD STATS: word  corr tot_errs count_in_ref count_in_hyp
The   0 4 0 4
was   0 2 0 2
are   0 2 0 2
and   0 2 0 2
wet   0 1 0 1
well.   0 1 0 1
too   0 1 0 1
they   0 1 0 1
the   0 1 0 1
tell   0 1 0 1
tales   0 1 0 1
table.   0 1 0 1
swayed   0 1 0 1
stayed   0 1 0 1
stain   0 1 0 1
soon.   0 1 0 1
round   0 1 0 1
queen's   0 1 0 1
pleasant   0 1 0 1
pears   0 1 0 1
on   0 1 0 1
much   0 1 0 1
locked   0 1 0 1
little   0 1 0 1
kite   0 1 0 1
hours   0 1 0 1
for   0 1 0 1
fly   0 1 0 1
fit   0 1 0 1
false.   0 1 0 1
door   0 1 0 1
dipped   0 1 0 1
carpet.   0 1 0 1
by   0 1 0 1
but   0 1 0 1
bolted   0 1 0 1
big   0 1 0 1
barred,   0 1 0 1
as   0 1 0 1
aloft.   0 1 0 1
a   0 1 0 1
Ripe   0 1 0 1
A   0 1 0 1
   0 1 1 0

recogs-single_wav_test.txt
0:  ref=['']
0:  hyp=['The', 'little', 'tales', 'they', 'tell', 'are', 'false.', 'The', 'door', 'was', 'barred,', 'locked', 'and', 'bolted', 'as', 'well.', 'Ripe', 'pears', 'are', 'fit', 'for', 'a', "queen's", 'table.', 'A', 'big', 'wet', 'stain', 'was', 'on', 'the', 'round', 'carpet.', 'The', 'kite', 'dipped', 'and', 'swayed', 'but', 'stayed', 'aloft.', 'The', 'pleasant', 'hours', 'fly', 'by', 'much', 'too', 'soon.']

rtf-single_wav_test.txt
RTF: 0.0311
total_duration: 52.415 seconds
(0.01 hours)
processing time: 1.632 seconds (0.00 hours)
batch size: 4
num_beams: 1

The audio from my tests is here: https://www.dropbox.com/scl/fi/t3yplx3wzsdzwox66ljxi/OSK.wav?rlkey=p1z0q0mzudrhhwexk1ka71uaq&st=ncgs34k9&dl=0

Why is the Word Error Rate (WER) so high? Why is the audio transcription incomplete?

yuekaizhang commented 1 month ago

@teith https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py#L296 For single file test, you may replace the line with your ground truth first. Or you could test using a huggingface dataset like the way in run.py.

tianchengcheng-cn commented 1 month ago

Yep. Have you solved the same issue of incomplete transcription?

yuekaizhang commented 1 month ago

@tianchengcheng-cn Would you mind trying to increase the max_new_token here? https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py#L261 I think the transcription's token is more than 96. You may try a shorter wav file.

tianchengcheng-cn commented 1 month ago

@tianchengcheng-cn Would you mind trying to increase the max_new_token here? https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py#L261 I think the transcription's token is more than 96. You may try a shorter wav file.

The original whisper can transcript long audio completely, but it does not work through tensorrt-llm.

SaadKaleem commented 1 month ago

Hi, I'm facing a similar issue with degradation of WER whilst running batched transcriptions of ~20-30 seconds of audios from the Common Voice 16_1 Dataset (Spanish subset). WER seems to be fluctuating and not consistent.

Note that, WER seems fine whilst inputting with a batch size of 1.

The whisper engine was built using the following command, on the v0.9.0 tag/branch version:

python3 build.py --use_gpt_attention_plugin --use_gemm_plugin  --use_bert_attention_plugin --enable_context_fmha --max_batch_size 16 --max_beam_width 1 --max_input_len 256 --max_output_len 256

I have also tried with higher max_input_len and max_output_len. I have also increased the max_new_tokens to 192, while max_seq_length varying around 215-250.

yuekaizhang commented 1 month ago

Hi, I'm facing a similar issue with degradation of WER whilst running batched transcriptions of ~20-30 seconds of audios from the Common Voice 16_1 Dataset. WER seems to be fluctuating and not consistent.

Note that, WER seems fine whilst inputting with a batch size of 1.

The whisper engine was built using the following command, on the v0.9.0 tag/branch version:
python3 build.py --use_gpt_attention_plugin --use_gemm_plugin  --use_bert_attention_plugin --enable_context_fmha --max_batch_size 16 --max_beam_width 1 --max_input_len 256 --max_output_len 256
I have also tried with higher max_input_len and max_output_len. I have also increased the max_new_tokens to 192, while max_seq_length being around 212.

@SaadKaleem Could you tell me the WER details for batch 1 and batch > 1? e.g. insertion deletion and sub errors. You could paste them from generated log files.

SaadKaleem commented 1 month ago

Hi, I'm facing a similar issue with degradation of WER whilst running batched transcriptions of ~20-30 seconds of audios from the Common Voice 16_1 Dataset. WER seems to be fluctuating and not consistent. Note that, WER seems fine whilst inputting with a batch size of 1. The whisper engine was built using the following command, on the v0.9.0 tag/branch version:
python3 build.py --use_gpt_attention_plugin --use_gemm_plugin  --use_bert_attention_plugin --enable_context_fmha --max_batch_size 16 --max_beam_width 1 --max_input_len 256 --max_output_len 256
I have also tried with higher max_input_len and max_output_len. I have also increased the max_new_tokens to 192, while max_seq_length being around 212.
@SaadKaleem Could you tell me the WER details for batch 1 and batch > 1? e.g. insertion deletion and sub errors. You could paste them from generated log files.

Hi, sorry - I don't have the whisper.py file example setup currently, but I'll run that and report later in the day, hopefully.

Currently, the results that I've observed are based on a slightly different (real-time) use-case: using my own custom script/algorithm which feeds a varying window of audio chunks, and "validate" them depending on the previous transcriptions. So the WER can be slightly worse.

As an example, with an audio of length 20 seconds, and feeding some 6 seconds window (ensuring these are not silences) - a common issue that I'm observing is getting an empty transcription (but not always!) to this corresponding item in the batch, even though it clearly represents spoken audio. Can't seem to pinpoint the exact issue because it seems very non-deterministic, but I'll keep diagnosing.

github-actions[bot] commented 2 days ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

NVIDIA / TensorRT-LLM