NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.15k stars 899 forks source link

Question on how to perform cross-attention with FMHA kernel #1956

Open Ashwin-Ramesh2607 opened 1 month ago

Ashwin-Ramesh2607 commented 1 month ago

I am interested in performing multimodal cross-attention. I don't see issues in performing self-attention in encoder since i can use the BertAttention plugin. However, cross-attention would have query from one modality (with seq_len as x and key/value from another modality (with seq_len as y). It's possible that the 2 modalities have different sequence lengths.

Can I please get some guidance on how to accomplish this?

  1. Is there a way to perform FMHA when q and kv have different seq_len. Afaik, the BertAttention plugin switches to non-FMHA path for this.
  2. Can I modify the plugin or hack around this to ensure I use FMHA?
github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."