NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.73k stars 841 forks source link

enc_dec: prompt_embedding_table not passed to encoder model #1884

Open thefacetakt opened 3 weeks ago

thefacetakt commented 3 weeks ago

System Info

Tensorrt-LLM commit: 2a115dae84f13daaa54727534daa837c534eceb4 TensorRT-LLM version: 0.11.0.dev2024061800

Who can help?

No response

Information

Tasks

Reproduction

Build bart-large-cnn engines using official examples (https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/enc_dec) with two modifications:

1) Add --context_fmha disable (because of https://github.com/NVIDIA/TensorRT-LLM/issues/1883) 2) Add --max_prompt_embedding_table_size 32

While running (with run.py provide --prompt_table_path tmp/ptable_1024.npy, where ptable_1024.npy was generated by

import numpy as np
table = np.random.randn(1, 10, 1024).astype(np.float32)
np.save('tmp/ptable_1024.npy', table)

Expected behavior

run.py works correctly without errors, prompt_embedding_table is passed to encoder engine (as EncoderModel does have corresponding input https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/models/enc_dec/model.py#L628)

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800
[07/03/2024-09:52:17] [TRT-LLM] [W] This path is an encoder-decoder model. Using different handling.
[07/03/2024-09:52:19] [TRT-LLM] [I] Load engine takes: 1.5770442485809326 sec
[TensorRT-LLM][ERROR] Encountered an error in forwardAsync function: Input tensor 'prompt_embedding_table' not found; expected shape: (-1, 1024) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:198)
1       0x7f47aa13a79e tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::string, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 558
2       0x7f47aa382a68 tensorrt_llm::batch_manager::TrtEncoderModel::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 120
3       0x7f47aa3863d2 tensorrt_llm::batch_manager::TrtEncoderModel::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1554
4       0x7f47aa3b7fc1 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 113
5       0x7f47aa3bb6fd tensorrt_llm::executor::Executor::Impl::executionLoop() + 301
6       0x7f48eaeb0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f48eaeb0253]
7       0x7f4a69f2bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f4a69f2bac3]
8       0x7f4a69fbca04 clone + 68
Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/run.py", line 505, in <module>
    main(args)
  File "/app/tensorrt_llm/examples/run.py", line 345, in main
    outputs = runner.generate(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 466, in generate
    return self._initialize_and_fill_output(request_ids, end_id,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 520, in _initialize_and_fill_output
    return self._fill_output(responses, output_ids, end_id, return_dict,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 556, in _fill_output
    raise RuntimeError(response.error_msg)
RuntimeError: Encountered an error in forwardAsync function: Input tensor 'prompt_embedding_table' not found; expected shape: (-1, 1024) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:198)
1       0x7f47aa13a79e tensorrt_llm::runtime::TllmRuntime::setInputTensors(int, std::unordered_map<std::string, std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<tensorrt_llm::runtime::ITensor> > > > const&) + 558
2       0x7f47aa382a68 tensorrt_llm::batch_manager::TrtEncoderModel::executeBatch(tensorrt_llm::batch_manager::ScheduledRequests const&) + 120
3       0x7f47aa3863d2 tensorrt_llm::batch_manager::TrtEncoderModel::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&) + 1554
4       0x7f47aa3b7fc1 tensorrt_llm::executor::Executor::Impl::forwardAsync(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 113
5       0x7f47aa3bb6fd tensorrt_llm::executor::Executor::Impl::executionLoop() + 301
6       0x7f48eaeb0253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f48eaeb0253]
7       0x7f4a69f2bac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f4a69f2bac3]
8       0x7f4a69fbca04 clone + 68

With the power of pdb i confirmed, that the request passed to the executor does contain a valid prompt_tuning_configs -- with emedding_table of shape (10, 1024) and dtype float32

additional notes

I understand that 0.11.0dev is not a stable version of tensorrt-llm, but, hopefully, this will be fixed in a stable release (or sooner)

The endgame is to use enc-dec + prompt_embedding_table to run whisper model with tensorrt-llm cpp runtime, but the issue is easier to illustrate using official examples

yuekaizhang commented 2 weeks ago

The endgame is to use enc-dec + prompt_embedding_table to run whisper model with tensorrt-llm cpp runtime, but the issue is easier to illustrate using official examples

@thefacetakt Would you mind telling more details? Are you going to do prompt-tuning for whisper?

thefacetakt commented 2 weeks ago

@yuekaizhang

Well, the plan is:

Seems like it should work?

yuekaizhang commented 2 weeks ago

@yuekaizhang

Well, the plan is:

  • modify WhisperEncoder to have the same signature as regular EncoderModel
  • use prompt_embedding_table input to pass actual fbanks features to WhisperEncoder
  • use tritonserver with tensorrtllm_backend for inference.

Seems like it should work?

@thefacetakt We're currently implementing whisper support for triton tensorrt-llm backend. You could wait for the relase. Or you could try the python backend first https://github.com/k2-fsa/sherpa/tree/master/triton/whisper.