Closed lionsheep24 closed 3 weeks ago
We are investigating internally.
@lionsheep24 Would you mind trying fp16 precision ? I thought you're using fp32 here.
Also, what's the performace number e.g. RTF, WER you got by running the official example https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py. On A100, I expect you could finish decoding the huggingface audio test set in 8 secs with fp16.
After reporting the RTF number with offcial whisper run.py, could you paste the logs (files like errs.txt, rtf.txt) with your custom model combining with whisper/run.py ?
You may also try this env https://github.com/k2-fsa/sherpa/tree/master/triton/whisper#quick-start to check what performace number you could get. With this docker-compose file, we could match the env exactly.
@yuekaizhang Run convert_checkpoint with fp16 argument, you mean? since my audio sample is 1s audio and the results are clear. However, no results were obtained.
@lionsheep24 We need to first make sure if you could reproduce the offcial recipes' performance. Could you report what RTF and WER numbers you got after running example/whisper/run.py?
Run convert_checkpoint with fp16 argument, you mean?
Just remove the --fp32 options in your commands.
@yuekaizhang
@lionsheep24 We need to first make sure if you could reproduce the offcial recipes' performance. Could you report what RTF and WER numbers you got after running example/whisper/run.py?
With my model, removing fp32 options?
@yuekaizhang Trying fp16 precision throws an error during trtllm-build (encoder). I guess only fp16 model works (like large-v3). Can you clarify this issue?
[06/07/2024-04:13:17] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[06/07/2024-04:13:17] [TRT-LLM] [I] Set dtype to float16.
[06/07/2024-04:13:17] [TRT] [I] [MemUsageChange] Init CUDA: CPU +15, GPU +0, now: CPU 148, GPU 72666 (MiB)
[06/07/2024-04:13:22] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1939, GPU +348, now: CPU 2223, GPU 73014 (MiB)
[06/07/2024-04:13:22] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[06/07/2024-04:13:22] [TRT-LLM] [W] allreduce algorithm is selected automatically during execution now. use_custom_all_reduce will be deprecated in future releases.
[06/07/2024-04:13:22] [TRT-LLM] [I] Set nccl_plugin to None.
[06/07/2024-04:13:22] [TRT-LLM] [I] Set use_custom_all_reduce to False.
[06/07/2024-04:13:22] [TRT] [E] 4: [convolutionNode.cpp::validateTypes::76] Error Code 4: Internal Error (WhisperEncoder/conv1/conv1d/CONVOLUTION_0: input and kernel weights must have same type)
Traceback (most recent call last):
File "/usr/local/bin/trtllm-build", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 489, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 368, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 327, in build_and_save
engine = build_model(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 320, in build_model
return build(model, build_config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 841, in build
model(**inputs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
output = self.forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1797, in forward
x = self.conv1(x)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
output = self.forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/conv.py", line 212, in forward
return conv1d(input, self.weight.value,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 3375, in conv1d
output_2d = _create_tensor(layer.get_output(0), layer)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor
assert trt_tensor.shape.__len__(
AssertionError: tensor WhisperEncoder/conv1/conv1d/CONVOLUTION_0_output_0 has an invalid shape
Let me share my build script.
@yuekaizhang Trying fp16 precision throws an error during trtllm-build (encoder). I guess only fp16 model works (like large-v3). Can you clarify this issue?
[06/07/2024-04:13:17] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT [06/07/2024-04:13:17] [TRT-LLM] [I] Set dtype to float16. [06/07/2024-04:13:17] [TRT] [I] [MemUsageChange] Init CUDA: CPU +15, GPU +0, now: CPU 148, GPU 72666 (MiB) [06/07/2024-04:13:22] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1939, GPU +348, now: CPU 2223, GPU 73014 (MiB) [06/07/2024-04:13:22] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [06/07/2024-04:13:22] [TRT-LLM] [W] allreduce algorithm is selected automatically during execution now. use_custom_all_reduce will be deprecated in future releases. [06/07/2024-04:13:22] [TRT-LLM] [I] Set nccl_plugin to None. [06/07/2024-04:13:22] [TRT-LLM] [I] Set use_custom_all_reduce to False. [06/07/2024-04:13:22] [TRT] [E] 4: [convolutionNode.cpp::validateTypes::76] Error Code 4: Internal Error (WhisperEncoder/conv1/conv1d/CONVOLUTION_0: input and kernel weights must have same type) Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in <module> sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 489, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 368, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 327, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 320, in build_model return build(model, build_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 841, in build model(**inputs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__ output = self.forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1797, in forward x = self.conv1(x) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__ output = self.forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/conv.py", line 212, in forward return conv1d(input, self.weight.value, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 3375, in conv1d output_2d = _create_tensor(layer.get_output(0), layer) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor assert trt_tensor.shape.__len__( AssertionError: tensor WhisperEncoder/conv1/conv1d/CONVOLUTION_0_output_0 has an invalid shape
Let me share my build script.
- python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2
- trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable
@lionsheep24 Our internal fix which may related to this issue would sync to github in a week. Or you could manually convert your model to fp16 first. E.g. model = model.half()
@yuekaizhang
As you said, simply add .half()
to model = AutoModel.from_pretrained(model_name, use_safetensors=True)
solved the issue.
The wer problem was fixed. The root cause was language prompt.
Please refer to my 1s audio benchmark (my use case is transcribing short audio for streaming)
Method | Latency (sec) | |
---|---|---|
tensorrt-llm | 0.21 | 28.99 |
faster-whisper | 1.43 | 4.19 |
huggingface | 1.7 | 3.52 |
openai | 2.1 | 2.8 |
p.s : In my benchmark results, the tokens per second were higher for 5-second and 10-second audio inputs. Why doesn't the transcription speed scale linearly with the length of the input audio?
System Info
Who can help?
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I benchmarked trtllm-whisper served by triton, (built by newer version, the trtllm-build command. older ver was built by python build.py) but It was slower than flash-attention-implemented huggingface, faster whisper. The bottleneck of latency was decoding, which was about 500~700ms. (for 1s audio).
Also the transcription result was not correct and inconsistent even with max_beam_width of 1. I remember the built by older trtllm version was good in transcription.
After multiple tests, I tried to terminate tritonserver, but below error has thrown. Any help or advice would be appreciated!
My project is combiation of official whisper example, trtllm-python backend implementation and triton client example
I compiled my fine-tuned, huggingface whisper with below procedures.
python3 convert_from_distil_whisper.py --model_name /workspace/models/whisper-large-v2/2 --output_dir /workspace/models/whisper-openai --output_name large-v2
python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2 --dtype float32 --logits_dtype float32
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float32 --remove_input_padding disable
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/decoder --output_dir /workspace/models/1/decoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_beam_width 1 --max_batch_size 16 --max_output_len 100 --max_input_len 1024 --max_encoder_input_len 1500 --gemm_plugin float32 --bert_attention_plugin float32 --gpt_attention_plugin float32 --remove_input_padding disable
Expected behavior
Faster than huggingface, faster whisper with consistent cer performance
actual behavior
Slow inference,(RTF was about 1.0), inconsistent transcription result, and the server was unstable.
additional notes
Let me share my dockerfiles for reproduce this issue.
Environment variables for MPI
ENV MPI_HOME=/usr/local/mpi ENV PATH="$MPI_HOME/bin:$PATH" ENV LD_LIBRARY_PATH="$MPI_HOME/lib:$LD_LIBRARY_PATH"
Install necessary packages
RUN apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs
copy pip.conf
COPY .tmp/pip.conf /root/.config/pip/pip.conf
copy cacert.pem
COPY .tmp/cacert.pem /opt/conda/lib/python3.10/site-packages/certifi/cacert.pem
Inform Git about the CA bundle for certificate verification
RUN git config --global http.sslCAInfo /opt/conda/lib/python3.10/site-packages/certifi/cacert.pem
Upgrade pip and install necessary Python packages
RUN pip install --upgrade pip setuptools wheel
Clone the TensorRT-LLM repository
RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git /workspace/TensorRT-LLM && \ cd /workspace/TensorRT-LLM && \ git checkout b777bd6 WORKDIR /workspace/TensorRT-LLM
RUN pip install -r examples/whisper/requirements.txt
RUN pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com tensorrt-llm==0.11.0.dev2024060400 tiktoken datasets kaldialign openai-whisper librosa soundfile safetensors transformers janus
Setup Git LFS
RUN git lfs install
COPY models/whisper-large-v2 /workspace/models/whisper-large-v2 COPY ./assets /workspace/TensorRT-LLM/examples/whisper/assets
FROM nvcr.io/nvidia/tritonserver:24.03-py3
RUN apt update && apt-get install -y ffmpeg RUN python3 -m pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com tensorrt-llm==0.11.0.dev2024060400 RUN python3 -m pip install mpmath==1.3.0 gradio==3.50.2 tritonclient[all]
COPY stt_task/tensorrt_llm/triton/requirements.txt /workspace/requirements.txt WORKDIR /workspace RUN python3 -m pip install -r requirements.txt
COPY model
COPY ./models/whisper_large_v2_tensorrt_llm /workspace/models/whisper-large-v2-tensorrt-llm/1/whisper-large-v2
COPY src
COPY ./stt/triton/server /workspace/models/whisper-large-v2-tensorrt-llm/1 COPY ./config.pbtxt /workspace/models/whisper-large-v2-tensorrt-llm COPY ./launch_server.sh /workspace/launch_server.sh