NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.34k stars 936 forks source link

Several Questions about llava #861

Open jkl375 opened 8 months ago

jkl375 commented 8 months ago

image

  1. The "get_visual_features" function is to calculate the vision tower by tensorrt, and later "input_ids" is converted to CPU. Can it be calculated directly on the GPU to reduce the time for data to be moved back and forth between the GPU and CPU?
  2. What is "prompt_table"? I don't see this parameter in the original version of llava. What role does "prompt_table" play in inference?
symphonylyh commented 8 months ago

@jkl375

  1. good catch. I don't think we really need to move it to cpu. This is probably a typo and we'll fix that soon.
  2. prompt_table here is actually the visual output. This is a concept in LLM text prompt, here we use it to pass the visual output because they work in the same way -- concatenating the visual output/text prompt && LLM text input. Maybe I should change the variable names to make it less confusing in the multimodal context.
Anker-ZX-AI commented 4 months ago

@symphonylyh

I have a question about this part of the code: how does tensorrt_llm implement the mapping between input_id and the embeddings stored in prompt_table?

For example, if the length of input_id is 500, with 100 text tokens and 400 visual tokens, but the dimensions of prompt_table are [800, 5120], where 5120 is the dimension of each visual token.

How is this input_id mapped to the embedding?