Open thefacetakt opened 4 months ago
The endgame is to use enc-dec + prompt_embedding_table to run whisper model with tensorrt-llm cpp runtime, but the issue is easier to illustrate using official examples
@thefacetakt Would you mind telling more details? Are you going to do prompt-tuning for whisper?
@yuekaizhang
Well, the plan is:
WhisperEncoder
to have the same signature as regular EncoderModel
prompt_embedding_table
input to pass actual fbanks features to WhisperEncoder
Seems like it should work?
@yuekaizhang
Well, the plan is:
- modify
WhisperEncoder
to have the same signature as regularEncoderModel
- use
prompt_embedding_table
input to pass actual fbanks features toWhisperEncoder
- use tritonserver with tensorrtllm_backend for inference.
Seems like it should work?
@thefacetakt We're currently implementing whisper support for triton tensorrt-llm backend. You could wait for the relase. Or you could try the python backend first https://github.com/k2-fsa/sherpa/tree/master/triton/whisper.
@thefacetakt if you have no further questions, we will close this issue in one week.
System Info
Tensorrt-LLM commit: 2a115dae84f13daaa54727534daa837c534eceb4 TensorRT-LLM version: 0.11.0.dev2024061800
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Build
bart-large-cnn
engines using official examples (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec) with two modifications:1) Add
--context_fmha disable
(because of https://github.com/NVIDIA/TensorRT-LLM/issues/1883) 2) Add--max_prompt_embedding_table_size 32
While running (with
run.py
provide--prompt_table_path tmp/ptable_1024.npy
, where ptable_1024.npy was generated byExpected behavior
run.py
works correctly without errors,prompt_embedding_table
is passed to encoder engine (as EncoderModel does have corresponding input https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/models/enc_dec/model.py#L628)actual behavior
With the power of
pdb
i confirmed, that the request passed to the executor does contain a validprompt_tuning_configs
-- withemedding_table
of shape(10, 1024)
and dtypefloat32
additional notes
I understand that 0.11.0dev is not a stable version of tensorrt-llm, but, hopefully, this will be fixed in a stable release (or sooner)
The endgame is to use enc-dec + prompt_embedding_table to run whisper model with tensorrt-llm cpp runtime, but the issue is easier to illustrate using official examples