Open lionsheep24 opened 6 months ago
Updates here. I found some discussions about convert whisper to huggingface file, including changing layer name. If I proceed vice versa to the work in the link above, could it solve my problem?
Updates here. I found some discussions about convert whisper to huggingface file, including changing layer name. If I proceed vice versa to the work in the link above, could it solve my problem?
@lionsheep0724 You are correct. You need to convert huggingface file back into openai checkpoint file. You may refer this file https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/whisper/distil_whisper/convert_from_distil_whisper.py.
@yuekaizhang Yeah I have seen that page but I’m not sure the way it convert huggingface to openai works well even if the model is not distil-whisper (because of difference in layer name or architecture by itself)
@yuekaizhang Yeah I have seen that page but I’m not sure the way it convert huggingface to openai works well even if the model is not distil-whisper (because of difference in layer name or architecture by itself)
@lionsheep0724 You could first try to use the script to convert. If you find some errors, you may need to check the model_state_dict keys to make sure they could match.
Hi @yuekaizhang,
I've successfully compiled hf-whisper to tensorrt-llm and am currently looking to deploy the model using Triton. However, I'm encountering some confusion regarding the expected I/O format for the server.
example/whisper run.py appears to process inputs as audio file paths. Contrastingly, my understanding is that Triton generally expects inputs in the form of tensors or arrays (audio or mel features). The guide docs seem to use a similar input format as run.py for llama, (input_text) as shown in the documentation examples.
Could you clarify how I should handle the input format to properly integrate tensorrt-llm-whisper with Triton? Any guidance or pointers would be greatly appreciated!
Thank you!
Hi @yuekaizhang,
I've successfully compiled hf-whisper to tensorrt-llm and am currently looking to deploy the model using Triton. However, I'm encountering some confusion regarding the expected I/O format for the server.
example/whisper run.py appears to process inputs as audio file paths. Contrastingly, my understanding is that Triton generally expects inputs in the form of tensors or arrays (audio or mel features). The guide docs seem to use a similar input format as run.py for llama, (input_text) as shown in the documentation examples.
Could you clarify how I should handle the input format to properly integrate tensorrt-llm-whisper with Triton? Any guidance or pointers would be greatly appreciated!
Thank you!
@lionsheep0724 Check this python backend integration first https://github.com/k2-fsa/sherpa/tree/master/triton/whisper. We will support triton-trtllm-backend in the future.
@yuekaizhang
Thank you for sharing and quick reply!
I reviewed the link you mentioned and have some questions regarding the implementation:
In the tensorrt-llm Whisper example, run.py loads the model via WhisperTRTLLM, encodes with tensorrt_llm.runtime.session.Session, and decodes with tensorrt_llm.runtime.GenerationSession. client.py in your shared link sends an audio array to a deployed Triton server, and the response appears to be in encoded bytes (i.e., the transcribed result).
The script for launching Triton (launch_server.sh) only provides the compiled model path to tritonserver. According to the details mentioned, the model’s input will be an array and its output should be text.
Considering the above, I have a question about how tritonserver runs the compiled model: Does the decode_wav_file function in run.py correspond to the model's inference process in tritonserver? (i.e., does tritonserver perform encode and decode operations via tensorrt_llm.runtime.session.Session and tensorrt_llm.runtime.GenerationSession?)
P.S. : I'm considering building a pytriton server with some modifications to the decode_wav_file function. Will there be any performance degradation when using the trtllm backend with pytriton? What are your thoughts on my approach?
@yuekaizhang
Thank you for sharing and quick reply!
I reviewed the link you mentioned and have some questions regarding the implementation:
- In the tensorrt-llm Whisper example, run.py loads the model via WhisperTRTLLM, encodes with tensorrt_llm.runtime.session.Session, and decodes with tensorrt_llm.runtime.GenerationSession. client.py in your shared link sends an audio array to a deployed Triton server, and the response appears to be in encoded bytes (i.e., the transcribed result).
- The script for launching Triton (launch_server.sh) only provides the compiled model path to tritonserver. According to the details mentioned, the model’s input will be an array and its output should be text.
Considering the above, I have a question about how tritonserver runs the compiled model: Does the decode_wav_file function in run.py correspond to the model's inference process in tritonserver? (i.e., does tritonserver perform encode and decode operations via tensorrt_llm.runtime.session.Session and tensorrt_llm.runtime.GenerationSession?)
P.S. : I'm considering building a pytriton server with some modifications to the decode_wav_file function. Will there be any performance degradation when using the trtllm backend with pytriton? What are your thoughts on my approach?
Trtllm backend is not ready for enc-dec style models for now. You may try to use pytriton to directly warp the decode_wav_file function. Also, you are welcome to contribute once the pytriton solution ready.
@yuekaizhang Yeah let me follow example you shared first, then pytriton will be my next step. Currently, I launched tritonserver with my compiled model. But I'm doubtful whether the pytriton backend can fully leverage the capabilities of the tensorrt-llm engine.
Hi @lionsheep24 do u still have further issue or question now? If not, we'll close it soon.
System Info
I have pretrained whisper-large-v2 model with my custom dataset, and tried to build tensorrt-llm. But I got
[Errno 2] No such file or directory: '/workspace/models/whisper-large-v2/large-v2.pt'
when I run python3 build.py, even though given model dir has no .pt file. I found hugigngface-distil-whisper in README.md but I could not find large-v2, huggingface whisper implementation in this repo. Is there way to build huggingface whisper?Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
python3 build.py --model_dir /workspace/models/whisper-large-v2 --model_name large-v2 --dtype float16 --max_batch_size 16 --output_dir whisper-large-v2-tensorrt-llm --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_bert_attention_plugin float16 --enable_context_fmha