Closed Selimonder closed 1 year ago
I may have found a solution 🤔
input_length = shape(hidden_states, 1)
x = gpt_decoder_layer(m2_embedding(m2_inputs)) if trt_llm.functional.eq(input_length, 1) else gpt_decoder_layer(concat([m1_embedding(m1_inputs), m2_embedding(m2_inputs)], dim=1))
switching to a one-liner and using eq
from functional
seem to do the trick (unless i am measuring it wrong). I am still eager to hear how would authors handle this problem.
Ps: actually my solution was invalid, still looking for help for this issue 😄
Hi @Selimonder,
Thanks a lot for your interest in TensorRT-LLM.
I'm sorry because I don't have a lot of time to give you a proper answer (I'm about to catch a plane) but let me suggest a few things that could be useful. TensorRT-LLM will construct a TensorRT graph behind the scene and that graph will be compiled into a TensorRT engine. For your solution to work, you need to add the if .. then .. else
construct in the TensorRT graph.
For that, the TensorRT API has a function called add_if_conditional
, that produces an If-Then-Else
node in the graph. That function is not exposed in TensorRT-LLM as we have not worked on a LLM that required it but you can start from one of the "examples" in tensorrt_llm.functional
to create your own if_then_else
function. There's an example of how to create an If
node here.
I hope it'll unblock you. If not, let us know ;)
Thanks, Julien
Hey @jdemouth-nvidia,
I appreciate your prompt and detailed response.
Initially, I pondered using the where
function from the functional
API within tensorrt-llm. However, I encountered a shape constraint which prevented me from using outputs of differing lengths, specifically [batch_size, t, n_emdb] vs [batch_size, t + n, n_emdb]. Following your recommendation, I proceeded to implement an if_then_else
function. Then stumbled upon this error:
[10/22/2023-02:22:30] [TRT] [E] 4: kOPT values for profile 0 violate shape constraints: (Unnamed Layer* 57) [Output]: dimensions not compatible for if-conditional outputs Condition '==' violated: 1 != 2.
[10/22/2023-02:22:30] [TRT] [E] 4: [shapeCompiler.cpp::evaluateShapeChecks::1311] Error Code 4: Internal Error (kOPT values for profile 0 violate shape constraints: (Unnamed Layer* 57) [Output]: dimensions not compatible for if-conditional outputs Condition '==' violated: 1 != 2.)
Digging deeper into the tensorrt.IIfConditional
on tensorrt documentation, I noticed following constraints:
- Both the trueSubgraph and falseSubgraph must be defined.
- The number of output tensors in both subgraphs should match.
- The type and shape of each output tensor from the true/false subgraphs must be identical.
The if_then_else
function works as expected when the inputs are identical. However, perhaps I am reading it wrong, but it doesn't align with my specific needs. In the first pass (if False), I need to concatenate both modalities. But in the subsequent phase (if True), the input's time dimension differs, represented as [batch_size, time, n_embd].
Could you share your thoughts on how to navigate this?
Actually found a much easier (possibly less efficient) alternative: Given that constructing a graph with conditional flows and varying input shapes isn't straightforward with TensorRT, I created two separate engines and tweaked generation.py
to handle two runtimes
. The first runtime manages the concatenated hidden states, while the second focuses solely on a single modality, addressing the kv-caching scenario I previously discussed.
Thanks for resolving this with your nice idea:). In addition to building two separate TRT engines, another alternative approach is to build two TensorRT profiles sharing a single TensorRT engine(you can refer to this code), which can help save the memory usage.
Since it has already been resolved with your own idea, I just mark it as "closed". For any other issues met, feel free to ask.
June
Neat! Great job @Selimonder . I'll talk to the TensorRT team to see if they have an alternative approach to suggest. In other words, is there a way in TensorRT to deal with the different shapes. And, if not, is that an improvement they should consider.
Another update about this topic:
I've got a working model, but despite having the same input/output lengths the custom model (50 tokens per second) seems much slower than the original GPT2 (180 tokens per second). Suspecting that two runtime idea is the cause of the slowdown, would you agree?
I am now looking into @juney-nvidia's suggestion of
alternative approach is to build two TensorRT profiles sharing a single TensorRT engine(you can refer to this code), which can help save the memory usage.
Could you perhaps elaborate on how one could implement this? Should we basically create two networks and then use ._add_optimization_profile
?
cc: @jdemouth-nvidia
Hey,
Firstly, a big thank you for the fantastic work! 🎊
I'm currently attempting to extend your GPT2 example to develop a multi-modal GPT. Essentially, I duplicated the
tensorrt_llm/models/gpt
and introduced a new embedding along with a head to cater to a new modality. I rely onexamples/gpt/build.py
for building my model and useexamples/gpt/run.py
for generation.My main challenge lies in supporting the kv-cache, as there's a shift in model behavior between the initial and subsequent passes:
First pass (context): All the hidden states are passed to the decoder blocks. The flow is as follows:
Subsequent passes (generation): Only the hidden state of the second modality gets directed to the decoder blocks. This is because the hidden states from the first modality are already stored in the kv-cache, and we are merely decoding for the second modality. In this case, the length of
m2_inputs
is1
, given the kv-cache. Here's the flow:I made an initial attempt with this logic:
However, it didn't quite pan out as I had hoped.
Would you be able to guide me on how to design a model that can support the conditional flow I mentioned above?
Do let me know if you need any further clarification on any aspect. Thanks in advance!