Wasteful computations in cross attention?

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.52k stars 967 forks source link

Wasteful computations in cross attention? #2026

Open thefacetakt opened 3 months ago

thefacetakt commented 3 months ago

As far as i understand, when using cross_attention we first compute qkv = self.qkv(hidden_states), and then cross_qkv = self.qkv(encoder_output). But later only q from qkv is used, and only kv from cross_qkv is used.

Seems like wasteful computation. Perhaps, there should be two matrices: q and kv. (This will also improve quantization a bit, since there will be separate scales)

QiJune commented 3 months ago

@symphonylyh Could you please take a look? Thanks

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

symphonylyh commented 2 months ago

@thefacetakt correct. this is to be more consistent with the fused QKV gemm used in all models, but indeed will bring some redundant computation.

However, I think it won't be very significant in the entire computation:

1st qkv gemm on hidden states: this is arguably redundant. But since each time the length of hidden states is 1, the cost of this redundant computation might not be huge, i.e. [1, H] [H, 3X] vs [1, H] [H, X]
2nd qkv gemm on encoder output: this would be less important, as this gemm is done only once during the entire run. After that, the data is saved in cross kv cache and no long need to use encoder output.

Do you agree?

We can still investigate on (1), to see whether it's critical enough. Otherwise, maybe keeping consistent with other models with <1% slow down is acceptable.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."