TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Apache License 2.0
7.41k stars 800 forks source link

Performance issue at whisper in many aspects : latency, reproducibility, and more #1740

Closed lionsheep24 closed 3 weeks ago

lionsheep24 commented 1 month ago

System Info

Who can help?




I benchmarked trtllm-whisper served by triton, (built by newer version, the trtllm-build command. older ver was built by python build.py) but It was slower than flash-attention-implemented huggingface, faster whisper. The bottleneck of latency was decoding, which was about 500~700ms. (for 1s audio).

Also the transcription result was not correct and inconsistent even with max_beam_width of 1. I remember the built by older trtllm version was good in transcription.

After multiple tests, I tried to terminate tritonserver, but below error has thrown. Any help or advice would be appreciated!

[06/05/2024-15:25:21] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[06/05/2024-15:25:21] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaFreeHost(ptr): an illegal memory access was encountered (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:168)
1       0x7f2a9666ae9a tensorrt_llm::runtime::MemoryPool<tensorrt_llm::runtime::PinnedAllocator>::~MemoryPool() + 282
2       0x7f2cd0819495 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45495) [0x7f2cd0819495]
3       0x7f2cd0819610 on_exit + 0
4       0x7f2cd07fdd97 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d97) [0x7f2cd07fdd97]
5       0x7f2cd07fde40 __libc_start_main + 128
6       0x560e37d701a5 /opt/tritonserver/backends/python/triton_python_backend_stub(+0x271a5) [0x560e37d701a5]

My project is combiation of official whisper example, trtllm-python backend implementation and triton client example

I compiled my fine-tuned, huggingface whisper with below procedures.

  1. convert hf to openai model : python3 convert_from_distil_whisper.py --model_name /workspace/models/whisper-large-v2/2 --output_dir /workspace/models/whisper-openai --output_name large-v2
  2. convert checkpoint to tensorrt-llm way : python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2 --dtype float32 --logits_dtype float32
  3. Build trtllm encoder engine : trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float32 --remove_input_padding disable
  4. Build decoder engine : trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/decoder --output_dir /workspace/models/1/decoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_beam_width 1 --max_batch_size 16 --max_output_len 100 --max_input_len 1024 --max_encoder_input_len 1500 --gemm_plugin float32 --bert_attention_plugin float32 --gpt_attention_plugin float32 --remove_input_padding disable

Expected behavior

Faster than huggingface, faster whisper with consistent cer performance

actual behavior

Slow inference,(RTF was about 1.0), inconsistent transcription result, and the server was unstable.

additional notes

Let me share my dockerfiles for reproduce this issue.

  1. For model compile
    # Use the NVIDIA CUDA image with development tools and Ubuntu 22.04
    FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
    #FROM nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04
    # Set the working directory
    WORKDIR /workspace

Environment variables for MPI


Install necessary packages

RUN apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs

copy pip.conf

COPY .tmp/pip.conf /root/.config/pip/pip.conf

copy cacert.pem

COPY .tmp/cacert.pem /opt/conda/lib/python3.10/site-packages/certifi/cacert.pem

Inform Git about the CA bundle for certificate verification

RUN git config --global http.sslCAInfo /opt/conda/lib/python3.10/site-packages/certifi/cacert.pem

Upgrade pip and install necessary Python packages

RUN pip install --upgrade pip setuptools wheel

Clone the TensorRT-LLM repository

RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git /workspace/TensorRT-LLM && \ cd /workspace/TensorRT-LLM && \ git checkout b777bd6 WORKDIR /workspace/TensorRT-LLM

RUN pip install -r examples/whisper/requirements.txt

RUN pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com tensorrt-llm==0.11.0.dev2024060400 tiktoken datasets kaldialign openai-whisper librosa soundfile safetensors transformers janus

Setup Git LFS

RUN git lfs install

COPY models/whisper-large-v2 /workspace/models/whisper-large-v2 COPY ./assets /workspace/TensorRT-LLM/examples/whisper/assets

2. For tritonserver

FROM nvcr.io/nvidia/tritonserver:24.03-py3

RUN apt update && apt-get install -y ffmpeg RUN python3 -m pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com tensorrt-llm==0.11.0.dev2024060400 RUN python3 -m pip install mpmath==1.3.0 gradio==3.50.2 tritonclient[all]

COPY stt_task/tensorrt_llm/triton/requirements.txt /workspace/requirements.txt WORKDIR /workspace RUN python3 -m pip install -r requirements.txt

COPY model

COPY ./models/whisper_large_v2_tensorrt_llm /workspace/models/whisper-large-v2-tensorrt-llm/1/whisper-large-v2

COPY src

COPY ./stt/triton/server /workspace/models/whisper-large-v2-tensorrt-llm/1 COPY ./config.pbtxt /workspace/models/whisper-large-v2-tensorrt-llm COPY ./launch_server.sh /workspace/launch_server.sh

hijkzzz commented 1 month ago

We are investigating internally.

yuekaizhang commented 1 month ago

@lionsheep24 Would you mind trying fp16 precision ? I thought you're using fp32 here.

Also, what's the performace number e.g. RTF, WER you got by running the official example https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/run.py. On A100, I expect you could finish decoding the huggingface audio test set in 8 secs with fp16.

After reporting the RTF number with offcial whisper run.py, could you paste the logs (files like errs.txt, rtf.txt) with your custom model combining with whisper/run.py ?

You may also try this env https://github.com/k2-fsa/sherpa/tree/master/triton/whisper#quick-start to check what performace number you could get. With this docker-compose file, we could match the env exactly.

lionsheep24 commented 1 month ago

@yuekaizhang Run convert_checkpoint with fp16 argument, you mean? since my audio sample is 1s audio and the results are clear. However, no results were obtained.

yuekaizhang commented 1 month ago

@lionsheep24 We need to first make sure if you could reproduce the offcial recipes' performance. Could you report what RTF and WER numbers you got after running example/whisper/run.py?

Run convert_checkpoint with fp16 argument, you mean?

Just remove the --fp32 options in your commands.

lionsheep24 commented 1 month ago


@lionsheep24 We need to first make sure if you could reproduce the offcial recipes' performance. Could you report what RTF and WER numbers you got after running example/whisper/run.py?

With my model, removing fp32 options?

lionsheep24 commented 1 month ago

@yuekaizhang Trying fp16 precision throws an error during trtllm-build (encoder). I guess only fp16 model works (like large-v3). Can you clarify this issue?

[06/07/2024-04:13:17] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[06/07/2024-04:13:17] [TRT-LLM] [I] Set dtype to float16.
[06/07/2024-04:13:17] [TRT] [I] [MemUsageChange] Init CUDA: CPU +15, GPU +0, now: CPU 148, GPU 72666 (MiB)
[06/07/2024-04:13:22] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1939, GPU +348, now: CPU 2223, GPU 73014 (MiB)
[06/07/2024-04:13:22] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[06/07/2024-04:13:22] [TRT-LLM] [W] allreduce algorithm is selected automatically during execution now. use_custom_all_reduce will be deprecated in future releases.
[06/07/2024-04:13:22] [TRT-LLM] [I] Set nccl_plugin to None.
[06/07/2024-04:13:22] [TRT-LLM] [I] Set use_custom_all_reduce to False.
[06/07/2024-04:13:22] [TRT] [E] 4: [convolutionNode.cpp::validateTypes::76] Error Code 4: Internal Error (WhisperEncoder/conv1/conv1d/CONVOLUTION_0: input and kernel weights must have same type)
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 489, in main
    parallel_build(source, build_config, args.output_dir, workers,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 368, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 327, in build_and_save
    engine = build_model(build_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 320, in build_model
    return build(model, build_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 841, in build
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1797, in forward
    x = self.conv1(x)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/conv.py", line 212, in forward
    return conv1d(input, self.weight.value,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 3375, in conv1d
    output_2d = _create_tensor(layer.get_output(0), layer)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor
    assert trt_tensor.shape.__len__(
AssertionError: tensor WhisperEncoder/conv1/conv1d/CONVOLUTION_0_output_0 has an invalid shape

Let me share my build script.

  1. python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2
  2. trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable
yuekaizhang commented 1 month ago

@yuekaizhang Trying fp16 precision throws an error during trtllm-build (encoder). I guess only fp16 model works (like large-v3). Can you clarify this issue?

[06/07/2024-04:13:17] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT
[06/07/2024-04:13:17] [TRT-LLM] [I] Set dtype to float16.
[06/07/2024-04:13:17] [TRT] [I] [MemUsageChange] Init CUDA: CPU +15, GPU +0, now: CPU 148, GPU 72666 (MiB)
[06/07/2024-04:13:22] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1939, GPU +348, now: CPU 2223, GPU 73014 (MiB)
[06/07/2024-04:13:22] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[06/07/2024-04:13:22] [TRT-LLM] [W] allreduce algorithm is selected automatically during execution now. use_custom_all_reduce will be deprecated in future releases.
[06/07/2024-04:13:22] [TRT-LLM] [I] Set nccl_plugin to None.
[06/07/2024-04:13:22] [TRT-LLM] [I] Set use_custom_all_reduce to False.
[06/07/2024-04:13:22] [TRT] [E] 4: [convolutionNode.cpp::validateTypes::76] Error Code 4: Internal Error (WhisperEncoder/conv1/conv1d/CONVOLUTION_0: input and kernel weights must have same type)
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 489, in main
    parallel_build(source, build_config, args.output_dir, workers,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 368, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 327, in build_and_save
    engine = build_model(build_config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 320, in build_model
    return build(model, build_config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py", line 841, in build
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/enc_dec/model.py", line 1797, in forward
    x = self.conv1(x)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/conv.py", line 212, in forward
    return conv1d(input, self.weight.value,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 3375, in conv1d
    output_2d = _create_tensor(layer.get_output(0), layer)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 607, in _create_tensor
    assert trt_tensor.shape.__len__(
AssertionError: tensor WhisperEncoder/conv1/conv1d/CONVOLUTION_0_output_0 has an invalid shape

Let me share my build script.

  1. python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2
  2. trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable

@lionsheep24 Our internal fix which may related to this issue would sync to github in a week. Or you could manually convert your model to fp16 first. E.g. model = model.half()

lionsheep24 commented 1 month ago

@yuekaizhang As you said, simply add .half() to model = AutoModel.from_pretrained(model_name, use_safetensors=True) solved the issue. The wer problem was fixed. The root cause was language prompt. Please refer to my 1s audio benchmark (my use case is transcribing short audio for streaming)

Method Latency (sec)
tensorrt-llm 0.21 28.99
faster-whisper 1.43 4.19
huggingface 1.7 3.52
openai 2.1 2.8

p.s : In my benchmark results, the tokens per second were higher for 5-second and 10-second audio inputs. Why doesn't the transcription speed scale linearly with the length of the input audio?