Question: Can Context FMHA be used to implement Transformer in a vision encoder for multimodal models?

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.11k stars 896 forks source link

Question: Can Context FMHA be used to implement Transformer in a vision encoder for multimodal models? #2001

Open lmcl90 opened 1 month ago

lmcl90 commented 1 month ago

I see that the multi-model models in the example all use TensorRT directly to deploy vision encoders, why not use TensorRT-LLM? Are there known issues or challenges associated with integrating Context FMHA into visual encoders?

QiJune commented 1 month ago

Yes, you can try to use TensorRT-LLM for the vision encoders. We have Bert example, DiT example, and community also contribute a SDXL model. I think it's not hard to develop a ViT model.

lmcl90 commented 1 month ago

@QiJune Thanks for your replay. I will have a try.

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."