Using TensortLLM/TensorRT to compile custom model based on Mistral LLM and ViT image encoder

anjali-chadha commented 7 months ago

I have a model that combines two components:

Image Encoder: Based on the ViT-G/14 vision transformer model.
Language Model: A Mistral-based large language model (LLM).

At a higher level, the output from the Image Encoder is processed, concatenated with other tokens, and then fed into the Mistral LLM model.

In my current implementation, I have a single class that initializes these models and performs a forward pass on them.

For running inference on this model, I'm exploring TensorRT and TensorRT-LLM. It seems like these components can be individually compiled—Mistral is supported in TensorRT-LLM, and ViT-G can be compiled using TensorRT.

My question is: How can I leverage both TensorRT and TensorRT-LLM to run inference on this custom vision-language architecture? Specifically:

Is it possible to compile and optimize the two components (ViT-G and Mistral) separately using their respective tools (TensorRT and TensorRT-LLM)? If so, how can I combine the optimized components during inference to run the entire vision-language model pipeline efficiently? Any guidance or examples on this would be greatly appreciated. Thank you!

byshiue commented 7 months ago

You can refer the examples of multimodal here.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] commented 5 months ago

This issue was closed because it has been stalled for 15 days with no activity.

NVIDIA / TensorRT-LLM

Using TensortLLM/TensorRT to compile custom model based on Mistral LLM and ViT image encoder #1401