NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.55k stars 971 forks source link

100% WER on distil-whisper/distil-large-v2 #1620

Open esnvidia opened 5 months ago

esnvidia commented 5 months ago

System Info

DGX V100 and DGX A100

Who can help?

@ncomly-nvidia to add more folks.

Information

Tasks

Reproduction

Followed the whisper example. Got example engines working on A100 80GB and V100-16GB. To save the HF model in bin format I did:

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq, pipeline
import torch
from datasets import load_dataset, load_from_disk

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=False
)

model.save_pretrained('./distil-whisper/distil-large-v2', safe_serialization=False)

I had to download the mel_filters.npz and gpt2.tiktoken separately per the directions.

Example build and run cmds:

output_dir=distil_whisper_large_v2
python3 build.py --model_dir /workspace/models/whisper/assets/ --model_name distil-large-v2 --output_dir $output_dir --dtype float16 --enable_context_fmha --use_gpt_attention_plugin --use_gemm_plugin --use_bert_attention_plugin float16 

python3 run.py --engine_dir $output_dir --dataset hf-internal-testing/librispeech_asr_dummy --name librispeech_dummy_output --tokenizer_name gpt2 --assets_dir /models/whisper/assets/ --dataset librispeech_asr --results_dir /models/whisper/results

Expected behavior

Not get >100% WER on librispeech_asr :)

actual behavior

in errs-librispeech.txt

%WER = 150.73 Errors: 28722 insertions, 3162 deletions, 50714 substitutions, over 54798 reference words (922 correct) Search below for sections starting with PER-UTT DETAILS:, SUBSTITUTIONS:, DELETIONS:, INSERTIONS:, PER-WORD STATS:

in rtf-librispeech.txt

RTF: 0.0098 total_duration: 19396.121 seconds (5.39 hours) processing time: 189.115 seconds (0.05 hours) batch size: 4 num_beams: 1

additional notes

n/a

yuekaizhang commented 5 months ago

Did you use the file first https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/distil_whisper/convert_from_distil_whisper.py ?

See https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper#distil-whisper, you may need to convert huggingface checkpoint first.

@esnvidia

esnvidia commented 5 months ago

Yes, here's the exact steps I ran:

https://github.com/esnvidia/distil_whisper_hf2_triton


From: Yuekai Zhang @.> Sent: Tuesday, May 21, 2024 4:52:15 AM To: NVIDIA/TensorRT-LLM @.> Cc: Emanuel Scoullos @.>; Mention @.> Subject: Re: [NVIDIA/TensorRT-LLM] 100% WER on distil-whisper/distil-large-v2 (Issue #1620)

Did you use the file first https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/distil_whisper/convert_from_distil_whisper.py ?

See https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper#distil-whisper, you may need to convert huggingface checkpoint first.

@esnvidiahttps://github.com/esnvidia

— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/TensorRT-LLM/issues/1620#issuecomment-2122109459, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ATIYP6OIAK77BZJP7QOKJWLZDMDLTAVCNFSM6AAAAABH3JAE3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSGEYDSNBVHE. You are receiving this because you were mentioned.Message ID: @.***>

esnvidia commented 5 months ago

The test step:

python run.py --engine_dir $output_diry --name librispeech_dummy_output --tokenizer_name gpt2 --assets_dir ./assets/ --dataset librispeech_asr --results_dir ./results

Needs a little tweak to the cmd but should be simple for you to figure out.


From: Emanuel Scoullos @.> Sent: Tuesday, May 21, 2024 4:56:07 AM To: NVIDIA/TensorRT-LLM @.>; NVIDIA/TensorRT-LLM @.> Cc: Mention @.> Subject: Re: [NVIDIA/TensorRT-LLM] 100% WER on distil-whisper/distil-large-v2 (Issue #1620)

Yes, here's the exact steps I ran:

https://github.com/esnvidia/distil_whisper_hf2_triton


From: Yuekai Zhang @.> Sent: Tuesday, May 21, 2024 4:52:15 AM To: NVIDIA/TensorRT-LLM @.> Cc: Emanuel Scoullos @.>; Mention @.> Subject: Re: [NVIDIA/TensorRT-LLM] 100% WER on distil-whisper/distil-large-v2 (Issue #1620)

Did you use the file first https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/distil_whisper/convert_from_distil_whisper.py ?

See https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper#distil-whisper, you may need to convert huggingface checkpoint first.

@esnvidiahttps://github.com/esnvidia

— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/TensorRT-LLM/issues/1620#issuecomment-2122109459, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ATIYP6OIAK77BZJP7QOKJWLZDMDLTAVCNFSM6AAAAABH3JAE3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSGEYDSNBVHE. You are receiving this because you were mentioned.Message ID: @.***>

yuekaizhang commented 5 months ago

Oh, I see for distill-large-v2, you should use the default multilingual tokenizer rather than gpt2. @esnvidia

yuekaizhang commented 5 months ago

Yes, here's the exact steps I ran: https://github.com/esnvidia/distil_whisper_hf2_triton

Also, you are welcome to contribute this triton model_repo for distil whisper to sherpa/triton/whisper if you have some free time.

esnvidia commented 5 months ago

@yuekaizhang Are you sure it's multilingual? The step in the example shows gpt2:

here is the cmd

python3 run.py --engine_dir $output_dir --dataset hf-internal-testing/librispeech_asr_dummy --name librispeech_dummy_${output_dir} --tokenizer_name gpt2

as well as this step:

# download the gpt2.tiktoken
wget --directory-prefix=assets https://raw.githubusercontent.com/openai/whisper/main/whisper/assets/gpt2.tiktoken
esnvidia commented 5 months ago

@yuekaizhang confirmed the need for mulitlingual. This needs to be updated in the docs.

yuekaizhang commented 5 months ago

@yuekaizhang confirmed the need for mulitlingual. This needs to be updated in the docs.

Updated it. Now users don't need to specify tokenizer_name by themselves.

esnvidia commented 5 months ago

Awesome, but I still don't see the change reflected in the main branch. I'm looking here: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/whisper#distil-whisper

Is there a PR tied to this?

Also getting 100% WER using the Triton-ASR-Client by the way. Let me know if you want me to file an issue there. I think it simply involves copying the functions from the run.py here since I was able to get the 3% WER with that.

I can contribute to sherpa etc once this works E2E. :)

yuekaizhang commented 5 months ago

Is there a PR tied to this?

Yes. I have updated in the gitlab. It will sync to github several days later.

Also getting 100% WER using the Triton-ASR-Client by the way. Let me know if you want me to file an issue there. I think it simply involves copying the functions from the run.py here since I was able to get the 3% WER with that.

https://github.com/k2-fsa/sherpa/tree/master/triton/whisper#benchmark-using-dataset Could you try --whisper-prompt "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>" . If it can't work, you may file a issue under sherpa, and attach more details. I will investigate at there.

I can contribute to sherpa etc once this works E2E. :)

That sounds great. @esnvidia

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."