Closed anjali-chadha closed 5 months ago
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
This issue was closed because it has been stalled for 15 days with no activity.
I have a model that combines two components:
At a higher level, the output from the Image Encoder is processed, concatenated with other tokens, and then fed into the Mistral LLM model.
In my current implementation, I have a single class that initializes these models and performs a forward pass on them.
For running inference on this model, I'm exploring TensorRT and TensorRT-LLM. It seems like these components can be individually compiled—Mistral is supported in TensorRT-LLM, and ViT-G can be compiled using TensorRT.
My question is: How can I leverage both TensorRT and TensorRT-LLM to run inference on this custom vision-language architecture? Specifically:
Is it possible to compile and optimize the two components (ViT-G and Mistral) separately using their respective tools (TensorRT and TensorRT-LLM)? If so, how can I combine the optimized components during inference to run the entire vision-language model pipeline efficiently? Any guidance or examples on this would be greatly appreciated. Thank you!