Question on how to perform cross-attention with FMHA kernel

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Apache License 2.0

8.15k stars 899 forks source link

I am interested in performing multimodal cross-attention. I don't see issues in performing self-attention in encoder since i can use the BertAttention plugin. However, cross-attention would have query from one modality (with seq_len as x and key/value from another modality (with seq_len as y). It's possible that the 2 modalities have different sequence lengths.

Can I please get some guidance on how to accomplish this?

Is there a way to perform FMHA when q and kv have different seq_len. Afaik, the BertAttention plugin switches to non-FMHA path for this.
Can I modify the plugin or hack around this to ensure I use FMHA?

NVIDIA / TensorRT-LLM

Question on how to perform cross-attention with FMHA kernel #1956