NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.37k stars 796 forks source link

Build with fine-tuned-huggingface-whisper #1624

Open lionsheep24 opened 1 month ago

lionsheep24 commented 1 month ago

System Info

I have pretrained whisper-large-v2 model with my custom dataset, and tried to build tensorrt-llm. But I got [Errno 2] No such file or directory: '/workspace/models/whisper-large-v2/large-v2.pt' when I run python3 build.py, even though given model dir has no .pt file. I found hugigngface-distil-whisper in README.md but I could not find large-v2, huggingface whisper implementation in this repo. Is there way to build huggingface whisper?

Who can help?

No response

Information

Tasks

Reproduction

  1. Dockerfile
# Use an official NVIDIA PyTorch image as a parent image
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

# Set a working directory
WORKDIR /workspace

ENV MPI_HOME=/usr/local/mpi
ENV PATH="$MPI_HOME/bin:$PATH"
ENV LD_LIBRARY_PATH="$MPI_HOME/lib:$LD_LIBRARY_PATH"

# Install necessary packages and dependencies for Triton Server
RUN apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git

RUN pip install --upgrade pip setuptools wheel
RUN pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git /workspace/TensorRT-LLM
WORKDIR /workspace/TensorRT-LLM
RUN pip install -r examples/whisper/requirements.txt

RUN apt-get update && \
    apt-get install -y git-lfs && \
    git lfs install

WORKDIR /workspace

COPY models /workspace/models
  1. command python3 build.py --model_dir /workspace/models/whisper-large-v2 --model_name large-v2 --dtype float16 --max_batch_size 16 --output_dir whisper-large-v2-tensorrt-llm --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_bert_attention_plugin float16 --enable_context_fmha
  2. files in "models" folder
lionsheep0724 commented 1 month ago

Updates here. I found some discussions about convert whisper to huggingface file, including changing layer name. If I proceed vice versa to the work in the link above, could it solve my problem?

yuekaizhang commented 1 month ago

Updates here. I found some discussions about convert whisper to huggingface file, including changing layer name. If I proceed vice versa to the work in the link above, could it solve my problem?

@lionsheep0724 You are correct. You need to convert huggingface file back into openai checkpoint file. You may refer this file https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/distil_whisper/convert_from_distil_whisper.py.

lionsheep24 commented 1 month ago

@yuekaizhang Yeah I have seen that page but I’m not sure the way it convert huggingface to openai works well even if the model is not distil-whisper (because of difference in layer name or architecture by itself)

yuekaizhang commented 1 month ago

@yuekaizhang Yeah I have seen that page but I’m not sure the way it convert huggingface to openai works well even if the model is not distil-whisper (because of difference in layer name or architecture by itself)

@lionsheep0724 You could first try to use the script to convert. If you find some errors, you may need to check the model_state_dict keys to make sure they could match.

lionsheep24 commented 1 month ago

Hi @yuekaizhang,

I've successfully compiled hf-whisper to tensorrt-llm and am currently looking to deploy the model using Triton. However, I'm encountering some confusion regarding the expected I/O format for the server.

example/whisper run.py appears to process inputs as audio file paths. Contrastingly, my understanding is that Triton generally expects inputs in the form of tensors or arrays (audio or mel features). The guide docs seem to use a similar input format as run.py for llama, (input_text) as shown in the documentation examples.

Could you clarify how I should handle the input format to properly integrate tensorrt-llm-whisper with Triton? Any guidance or pointers would be greatly appreciated!

Thank you!

yuekaizhang commented 1 month ago

Hi @yuekaizhang,

I've successfully compiled hf-whisper to tensorrt-llm and am currently looking to deploy the model using Triton. However, I'm encountering some confusion regarding the expected I/O format for the server.

example/whisper run.py appears to process inputs as audio file paths. Contrastingly, my understanding is that Triton generally expects inputs in the form of tensors or arrays (audio or mel features). The guide docs seem to use a similar input format as run.py for llama, (input_text) as shown in the documentation examples.

Could you clarify how I should handle the input format to properly integrate tensorrt-llm-whisper with Triton? Any guidance or pointers would be greatly appreciated!

Thank you!

@lionsheep0724 Check this python backend integration first https://github.com/k2-fsa/sherpa/tree/master/triton/whisper. We will support triton-trtllm-backend in the future.

lionsheep24 commented 1 month ago

@yuekaizhang

Thank you for sharing and quick reply!

I reviewed the link you mentioned and have some questions regarding the implementation:

  1. In the tensorrt-llm Whisper example, run.py loads the model via WhisperTRTLLM, encodes with tensorrt_llm.runtime.session.Session, and decodes with tensorrt_llm.runtime.GenerationSession. client.py in your shared link sends an audio array to a deployed Triton server, and the response appears to be in encoded bytes (i.e., the transcribed result).

  2. The script for launching Triton (launch_server.sh) only provides the compiled model path to tritonserver. According to the details mentioned, the model’s input will be an array and its output should be text.

Considering the above, I have a question about how tritonserver runs the compiled model: Does the decode_wav_file function in run.py correspond to the model's inference process in tritonserver? (i.e., does tritonserver perform encode and decode operations via tensorrt_llm.runtime.session.Session and tensorrt_llm.runtime.GenerationSession?)

P.S. : I'm considering building a pytriton server with some modifications to the decode_wav_file function. Will there be any performance degradation when using the trtllm backend with pytriton? What are your thoughts on my approach?

yuekaizhang commented 1 month ago

@yuekaizhang

Thank you for sharing and quick reply!

I reviewed the link you mentioned and have some questions regarding the implementation:

  1. In the tensorrt-llm Whisper example, run.py loads the model via WhisperTRTLLM, encodes with tensorrt_llm.runtime.session.Session, and decodes with tensorrt_llm.runtime.GenerationSession. client.py in your shared link sends an audio array to a deployed Triton server, and the response appears to be in encoded bytes (i.e., the transcribed result).
  2. The script for launching Triton (launch_server.sh) only provides the compiled model path to tritonserver. According to the details mentioned, the model’s input will be an array and its output should be text.

Considering the above, I have a question about how tritonserver runs the compiled model: Does the decode_wav_file function in run.py correspond to the model's inference process in tritonserver? (i.e., does tritonserver perform encode and decode operations via tensorrt_llm.runtime.session.Session and tensorrt_llm.runtime.GenerationSession?)

P.S. : I'm considering building a pytriton server with some modifications to the decode_wav_file function. Will there be any performance degradation when using the trtllm backend with pytriton? What are your thoughts on my approach?

See here https://github.com/k2-fsa/sherpa/blob/master/triton/whisper/model_repo_whisper_trtllm/whisper/1/model.py.

Trtllm backend is not ready for enc-dec style models for now. You may try to use pytriton to directly warp the decode_wav_file function. Also, you are welcome to contribute once the pytriton solution ready.

lionsheep24 commented 1 month ago

@yuekaizhang Yeah let me follow example you shared first, then pytriton will be my next step. Currently, I launched tritonserver with my compiled model. But I'm doubtful whether the pytriton backend can fully leverage the capabilities of the tensorrt-llm engine.