NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.66k stars 989 forks source link

[feature request] qwen model's query logn-scaling attn #836

Open handoku opened 10 months ago

handoku commented 10 months ago

Qwen use qwen-style dynamic ntk and logn-scaling to generate better text in case of long context text input.

The trtllm implementation of qwen does not support logn-scaling right now, which result in low quality outputs.

I would like to provide a implementation. However, its a little diffcult for me to understand the gpt_attention.

My vanilla thought is multiplying q tensor with logn tensor before call gpt_attention. But every seq_len_idx value of q tensor is needed for caculating logseq_len_trained(seq_len_idx). I don't know how to get seq_len_idx value, especially in packed tensor mode.

Would you please give some help on this?Is there a convenient way to achieve this(even in a dirty hard-code way)?

Tlntin commented 10 months ago

like this, I finished it. commit link

c++ Implement the code: link

handoku commented 10 months ago

like this, I finished it. commit link

c++ Implement the code: link

@Tlntin Hi, thank you for reply

I am using trtllm release v0.7.0 and qwen 7b 1.0, I added another RotaryScalingType type and modified update_rotary_base_n_scale func directly to support qwen-stype dynamic ntk. as follow

inline __device__ float update_rotary_base_dynamic_ntk(
    const int kv_seq_len, const int max_positions, const int embed_dim, const float base, const float scale)
{
    const float ntk_alpha = exp2f(ceilf(log2f(1.f * kv_seq_len / max_positions) + 1.f)) - 1.f;
    return base * powf(ntk_alpha, embed_dim / (embed_dim- 2.f));
}

inline __device__ void update_rotary_base_n_scale(float& base, float& scale, RotaryScalingType const scale_type,
    const int rot_embed_dim, const int max_positions, const int seq_len)
{
    // only update the base and/or scale if needed based on scale_type
    if (scale_type == RotaryScalingType::kDYNAMIC)
    {
        if (seq_len > max_positions)
        {
            base = update_rotary_base(seq_len, max_positions, rot_embed_dim, base, scale);
        }
        scale = 1.0f; // scale is only used in base for dynamic scaling
    } 
    else if(scale_type == RotaryScalingType::kDYNAMIC_NTK_QWEN){
        if (seq_len > max_positions)
        {
            base = update_rotary_base_dynamic_ntk(seq_len, max_positions, rot_embed_dim, base, scale);
        }
        scale = 1.0f; // scale is only used in base for dynamic scaling
    }
    else if (scale_type == RotaryScalingType::kLINEAR)
    {
        scale = 1.0f / scale;
    }
}

After saw the code in main branch, I am not sure my modification is right or not, there are some other places invoking update_rotary_base_n_scale. why added another update_dynamic_scaling_rotary entry, which make me more confused= = .

Nevertheless, what you post looks like only about rope base update. logn attn is still missing, I have tested and find that removing logn-scaling will hurt the performance of qwen agent

I find some commented code about logn scaling in your repository, but it looks like not compatible with packed tensor mode. When enable paged attention, the qkv tensor's shape is [1, num_tokens, qkv_dim], is that right?

Tlntin commented 10 months ago

it seems your C++ code may work better, more similar to raw pytorch code! About logn_scale, I think you may need do C++ implementation as same! you need to change plugin code. TensorRT-LLM/cpp/tensorrt_llm/plugins/gptAttentionPlugin/gptAttentionPlugin.cpp at 6cc5e177ff2fb60b1aab3b03fa0534b5181cf0f1 · NVIDIA/TensorRT-LLM (github.com)

there has two function enqueueContext/enqueueGeneration

TensorRT-LLM/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp at 6cc5e177ff2fb60b1aab3b03fa0534b5181cf0f1 · NVIDIA/TensorRT-LLM (github.com)

TensorRT-LLM/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp at 6cc5e177ff2fb60b1aab3b03fa0534b5181cf0f1 · NVIDIA/TensorRT-LLM (github.com)

I think logn scale implementation may has some difficulty.

my code logn_scaling can only use without gpt attention plugin, but seems it work not well, so I commented it!

Lzhang-hub commented 9 months ago

The trtllm implementation of qwen does not support logn-scaling right now, which result in different infer result.

@handoku I have same question,do you hace any update? thanks

handoku commented 9 months ago

The trtllm implementation of qwen does not support logn-scaling right now, which result in different infer result.

@handoku I have same question,do you hace any update? thanks

sry, no pregress yet. Maybe make the trtllm team more aware of this painful problem and help to solve it could save us.

Njuapp commented 1 month ago

@Tlntin have you ever tested TRT-LLM Qwen1 on long input? I found it is empty output for inputs as long as 6K (smaller than 8K,the training length).

Tlntin commented 1 month ago

@Tlntin have you ever tested TRT-LLM Qwen1 on long input? I found it is empty output for inputs as long as 6K (smaller than 8K,the training length).

i tested, it work well, you need to same change like above.

Njuapp commented 1 month ago

@Tlntin have you ever tested TRT-LLM Qwen1 on long input? I found it is empty output for inputs as long as 6K (smaller than 8K,the training length).

i tested, it work well, you need to same change like above.

Did you mean use changes in your commit

Tlntin commented 1 month ago

Did you mean use changes in your commit

yes