Open teith opened 1 month ago
@teith https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py#L296 For single file test, you may replace the line with your ground truth first. Or you could test using a huggingface dataset like the way in run.py.
Yep. Have you solved the same issue of incomplete transcription?
@tianchengcheng-cn Would you mind trying to increase the max_new_token here? https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py#L261 I think the transcription's token is more than 96. You may try a shorter wav file.
@tianchengcheng-cn Would you mind trying to increase the max_new_token here? https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py#L261 I think the transcription's token is more than 96. You may try a shorter wav file.
The original whisper can transcript long audio completely, but it does not work through tensorrt-llm.
Hi, I'm facing a similar issue with degradation of WER whilst running batched transcriptions of ~20-30 seconds of audios from the Common Voice 16_1 Dataset (Spanish subset). WER seems to be fluctuating and not consistent.
Note that, WER seems fine whilst inputting with a batch size of 1.
The whisper engine was built using the following command, on the v0.9.0 tag/branch version:
python3 build.py --use_gpt_attention_plugin --use_gemm_plugin --use_bert_attention_plugin --enable_context_fmha --max_batch_size 16 --max_beam_width 1 --max_input_len 256 --max_output_len 256
I have also tried with higher max_input_len
and max_output_len
. I have also increased the max_new_tokens
to 192, while max_seq_length
varying around 215-250.
Hi, I'm facing a similar issue with degradation of WER whilst running batched transcriptions of ~20-30 seconds of audios from the Common Voice 16_1 Dataset. WER seems to be fluctuating and not consistent.
Note that, WER seems fine whilst inputting with a batch size of 1.
The whisper engine was built using the following command, on the v0.9.0 tag/branch version:
python3 build.py --use_gpt_attention_plugin --use_gemm_plugin --use_bert_attention_plugin --enable_context_fmha --max_batch_size 16 --max_beam_width 1 --max_input_len 256 --max_output_len 256
I have also tried with higher
max_input_len
andmax_output_len
. I have also increased themax_new_tokens
to 192, whilemax_seq_length
being around 212.
@SaadKaleem Could you tell me the WER details for batch 1 and batch > 1? e.g. insertion deletion and sub errors. You could paste them from generated log files.
Hi, I'm facing a similar issue with degradation of WER whilst running batched transcriptions of ~20-30 seconds of audios from the Common Voice 16_1 Dataset. WER seems to be fluctuating and not consistent. Note that, WER seems fine whilst inputting with a batch size of 1. The whisper engine was built using the following command, on the v0.9.0 tag/branch version:
python3 build.py --use_gpt_attention_plugin --use_gemm_plugin --use_bert_attention_plugin --enable_context_fmha --max_batch_size 16 --max_beam_width 1 --max_input_len 256 --max_output_len 256
I have also tried with higher
max_input_len
andmax_output_len
. I have also increased themax_new_tokens
to 192, whilemax_seq_length
being around 212.@SaadKaleem Could you tell me the WER details for batch 1 and batch > 1? e.g. insertion deletion and sub errors. You could paste them from generated log files.
Hi, sorry - I don't have the whisper.py file example setup currently, but I'll run that and report later in the day, hopefully.
Currently, the results that I've observed are based on a slightly different (real-time) use-case: using my own custom script/algorithm which feeds a varying window of audio chunks, and "validate" them depending on the previous transcriptions. So the WER can be slightly worse.
As an example, with an audio of length 20 seconds, and feeding some 6 seconds window (ensuring these are not silences) - a common issue that I'm observing is getting an empty transcription (but not always!) to this corresponding item in the batch, even though it clearly represents spoken audio. Can't seem to pinpoint the exact issue because it seems very non-deterministic, but I'll keep diagnosing.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
System Info
GPU properties: GPU name: NVIDIA A100 GPU memory size: 80 GB
TensorRT-LLM branch: main TensorRT-LLM version: 0.11.0.dev2024052800
OS: Ubuntu 22.04
Who can help?
@kaiyux, @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I follow official instruction to build and run Whisper:
python3 run.py --name single_wav_test --engine_dir $output_dir --input_file OSK.wav --assets_dir ./assets
Expected behavior
The expected behavior is a low Word Error Rate (WER) and a complete transcription of the input audio file.
actual behavior
The actual behaviour shows a very high WER (4900.00%) and incomplete transcription of the input audio. The transcription does not cover the entire content of the audio file.
additional notes
The full report:
The audio from my tests is here: https://www.dropbox.com/scl/fi/t3yplx3wzsdzwox66ljxi/OSK.wav?rlkey=p1z0q0mzudrhhwexk1ka71uaq&st=ncgs34k9&dl=0
Why is the Word Error Rate (WER) so high? Why is the audio transcription incomplete?