Can TensorRT-LLM support the modified QWenAttention

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

7.94k stars 865 forks source link

Can TensorRT-LLM support the modified QWenAttention #1037

Closed Hukongtao closed 2 months ago

Hukongtao commented 6 months ago

As the title describes, I slightly modified the QWenAttention. Before: After: self.is_causal = False

I use tensorrt-llm to accelerate the modified Qwen model, but there will be a problem of inconsistent numerical accuracy. I want to know if there is a way to support the modified QWenAttention.

Hukongtao commented 6 months ago

For attention_mask_type, I tried both causal and bidirectional, but the results didn't make any difference. https://github.com/NVIDIA/TensorRT-LLM/blob/3d56a445e8ebf888e78be638faf6beec0a78f3c2/tensorrt_llm/models/qwen/model.py#L133

Hukongtao commented 6 months ago

For attention_mask_type, I tried both causal and bidirectional, but the results didn't make any difference.

https://github.com/NVIDIA/TensorRT-LLM/blob/3d56a445e8ebf888e78be638faf6beec0a78f3c2/tensorrt_llm/models/qwen/model.py#L133

What is the difference between AttentionMaskType.causal and AttentionMaskType.bidirectional? Why use different types but the hidden_state is the same?

Hukongtao commented 6 months ago

Hi I solved my problem by change the true to false: https://github.com/NVIDIA/TensorRT-LLM/blob/3d56a445e8ebf888e78be638faf6beec0a78f3c2/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp#L1439