NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.54k stars 969 forks source link

Handling kv-cache in multi-modal GPT #51

Closed Selimonder closed 1 year ago

Selimonder commented 1 year ago

Hey,

Firstly, a big thank you for the fantastic work! 🎊

I'm currently attempting to extend your GPT2 example to develop a multi-modal GPT. Essentially, I duplicated the tensorrt_llm/models/gpt and introduced a new embedding along with a head to cater to a new modality. I rely on examples/gpt/build.py for building my model and use examples/gpt/run.py for generation.

My main challenge lies in supporting the kv-cache, as there's a shift in model behavior between the initial and subsequent passes:

  1. First pass (context): All the hidden states are passed to the decoder blocks. The flow is as follows:

    gpt_decoder_layer(concat([m1_embedding(m1_inputs), m2_embedding(m2_inputs)], dim=1))
  2. Subsequent passes (generation): Only the hidden state of the second modality gets directed to the decoder blocks. This is because the hidden states from the first modality are already stored in the kv-cache, and we are merely decoding for the second modality. In this case, the length of m2_inputs is 1, given the kv-cache. Here's the flow:

    gpt_decoder_layer(m2_embedding(m2_inputs))

I made an initial attempt with this logic:

if shape(input_ids.data, 1) == 1:
  # If we receive a single token, execute the kv-cache pass (generation)
  hidden_states = gpt_decoder_layer(m2_embedding(m2_inputs))
else:
  # If all tokens are received, execute a full pass (context)
  hidden_states = gpt_decoder_layer(concat([m1_embedding(m1_inputs), m2_embedding(m2_inputs)], dim=1))

However, it didn't quite pan out as I had hoped.

Would you be able to guide me on how to design a model that can support the conditional flow I mentioned above?

Do let me know if you need any further clarification on any aspect. Thanks in advance!

Selimonder commented 1 year ago

I may have found a solution 🤔

input_length = shape(hidden_states, 1)
x = gpt_decoder_layer(m2_embedding(m2_inputs)) if trt_llm.functional.eq(input_length, 1) else gpt_decoder_layer(concat([m1_embedding(m1_inputs), m2_embedding(m2_inputs)], dim=1))

switching to a one-liner and using eq from functional seem to do the trick (unless i am measuring it wrong). I am still eager to hear how would authors handle this problem.

Ps: actually my solution was invalid, still looking for help for this issue 😄

jdemouth-nvidia commented 1 year ago

Hi @Selimonder,

Thanks a lot for your interest in TensorRT-LLM.

I'm sorry because I don't have a lot of time to give you a proper answer (I'm about to catch a plane) but let me suggest a few things that could be useful. TensorRT-LLM will construct a TensorRT graph behind the scene and that graph will be compiled into a TensorRT engine. For your solution to work, you need to add the if .. then .. else construct in the TensorRT graph.

For that, the TensorRT API has a function called add_if_conditional, that produces an If-Then-Else node in the graph. That function is not exposed in TensorRT-LLM as we have not worked on a LLM that required it but you can start from one of the "examples" in tensorrt_llm.functional to create your own if_then_else function. There's an example of how to create an If node here.

I hope it'll unblock you. If not, let us know ;)

Thanks, Julien

Selimonder commented 1 year ago

Hey @jdemouth-nvidia,

I appreciate your prompt and detailed response.

Initially, I pondered using the where function from the functional API within tensorrt-llm. However, I encountered a shape constraint which prevented me from using outputs of differing lengths, specifically [batch_size, t, n_emdb] vs [batch_size, t + n, n_emdb]. Following your recommendation, I proceeded to implement an if_then_else function. Then stumbled upon this error:

[10/22/2023-02:22:30] [TRT] [E] 4: kOPT values for profile 0 violate shape constraints: (Unnamed Layer* 57) [Output]: dimensions not compatible for if-conditional outputs Condition '==' violated: 1 != 2.
[10/22/2023-02:22:30] [TRT] [E] 4: [shapeCompiler.cpp::evaluateShapeChecks::1311] Error Code 4: Internal Error (kOPT values for profile 0 violate shape constraints: (Unnamed Layer* 57) [Output]: dimensions not compatible for if-conditional outputs Condition '==' violated: 1 != 2.)

Digging deeper into the tensorrt.IIfConditional on tensorrt documentation, I noticed following constraints:

  • Both the trueSubgraph and falseSubgraph must be defined.
  • The number of output tensors in both subgraphs should match.
  • The type and shape of each output tensor from the true/false subgraphs must be identical.

The if_then_else function works as expected when the inputs are identical. However, perhaps I am reading it wrong, but it doesn't align with my specific needs. In the first pass (if False), I need to concatenate both modalities. But in the subsequent phase (if True), the input's time dimension differs, represented as [batch_size, time, n_embd].

Could you share your thoughts on how to navigate this?

Selimonder commented 1 year ago

Actually found a much easier (possibly less efficient) alternative: Given that constructing a graph with conditional flows and varying input shapes isn't straightforward with TensorRT, I created two separate engines and tweaked generation.py to handle two runtimes. The first runtime manages the concatenated hidden states, while the second focuses solely on a single modality, addressing the kv-caching scenario I previously discussed.

juney-nvidia commented 1 year ago

Thanks for resolving this with your nice idea:). In addition to building two separate TRT engines, another alternative approach is to build two TensorRT profiles sharing a single TensorRT engine(you can refer to this code), which can help save the memory usage.

Since it has already been resolved with your own idea, I just mark it as "closed". For any other issues met, feel free to ask.

June

jdemouth-nvidia commented 1 year ago

Neat! Great job @Selimonder . I'll talk to the TensorRT team to see if they have an alternative approach to suggest. In other words, is there a way in TensorRT to deal with the different shapes. And, if not, is that an improvement they should consider.

Selimonder commented 1 year ago

Another update about this topic:

I've got a working model, but despite having the same input/output lengths the custom model (50 tokens per second) seems much slower than the original GPT2 (180 tokens per second). Suspecting that two runtime idea is the cause of the slowdown, would you agree?

I am now looking into @juney-nvidia's suggestion of

alternative approach is to build two TensorRT profiles sharing a single TensorRT engine(you can refer to this code), which can help save the memory usage.

Could you perhaps elaborate on how one could implement this? Should we basically create two networks and then use ._add_optimization_profile?

cc: @jdemouth-nvidia