Open handoku opened 10 months ago
like this, I finished it. commit link
c++ Implement the code: link
like this, I finished it. commit link
c++ Implement the code: link
@Tlntin Hi, thank you for reply
I am using trtllm release v0.7.0 and qwen 7b 1.0, I added another RotaryScalingType
type and modified update_rotary_base_n_scale
func directly to support qwen-stype dynamic ntk. as follow
inline __device__ float update_rotary_base_dynamic_ntk(
const int kv_seq_len, const int max_positions, const int embed_dim, const float base, const float scale)
{
const float ntk_alpha = exp2f(ceilf(log2f(1.f * kv_seq_len / max_positions) + 1.f)) - 1.f;
return base * powf(ntk_alpha, embed_dim / (embed_dim- 2.f));
}
inline __device__ void update_rotary_base_n_scale(float& base, float& scale, RotaryScalingType const scale_type,
const int rot_embed_dim, const int max_positions, const int seq_len)
{
// only update the base and/or scale if needed based on scale_type
if (scale_type == RotaryScalingType::kDYNAMIC)
{
if (seq_len > max_positions)
{
base = update_rotary_base(seq_len, max_positions, rot_embed_dim, base, scale);
}
scale = 1.0f; // scale is only used in base for dynamic scaling
}
else if(scale_type == RotaryScalingType::kDYNAMIC_NTK_QWEN){
if (seq_len > max_positions)
{
base = update_rotary_base_dynamic_ntk(seq_len, max_positions, rot_embed_dim, base, scale);
}
scale = 1.0f; // scale is only used in base for dynamic scaling
}
else if (scale_type == RotaryScalingType::kLINEAR)
{
scale = 1.0f / scale;
}
}
After saw the code in main branch, I am not sure my modification is right or not, there are some other places invoking update_rotary_base_n_scale
. why added another update_dynamic_scaling_rotary
entry, which make me more confused= = .
Nevertheless, what you post looks like only about rope base update. logn attn is still missing, I have tested and find that removing logn-scaling will hurt the performance of qwen agent
I find some commented code about logn scaling in your repository, but it looks like not compatible with packed tensor mode. When enable paged attention, the qkv tensor's shape is [1, num_tokens, qkv_dim], is that right?
it seems your C++ code may work better, more similar to raw pytorch code! About logn_scale, I think you may need do C++ implementation as same! you need to change plugin code. TensorRT-LLM/cpp/tensorrt_llm/plugins/gptAttentionPlugin/gptAttentionPlugin.cpp at 6cc5e177ff2fb60b1aab3b03fa0534b5181cf0f1 · NVIDIA/TensorRT-LLM (github.com)
there has two function enqueueContext/enqueueGeneration
I think logn scale implementation may has some difficulty.
my code logn_scaling can only use without gpt attention plugin, but seems it work not well, so I commented it!
The trtllm implementation of qwen does not support logn-scaling right now, which result in different infer result.
@handoku I have same question,do you hace any update? thanks
The trtllm implementation of qwen does not support logn-scaling right now, which result in different infer result.
@handoku I have same question,do you hace any update? thanks
sry, no pregress yet. Maybe make the trtllm team more aware of this painful problem and help to solve it could save us.
@Tlntin have you ever tested TRT-LLM Qwen1 on long input? I found it is empty output for inputs as long as 6K (smaller than 8K,the training length).
@Tlntin have you ever tested TRT-LLM Qwen1 on long input? I found it is empty output for inputs as long as 6K (smaller than 8K,the training length).
i tested, it work well, you need to same change like above.
@Tlntin have you ever tested TRT-LLM Qwen1 on long input? I found it is empty output for inputs as long as 6K (smaller than 8K,the training length).
i tested, it work well, you need to same change like above.
Did you mean use changes in your commit
Did you mean use changes in your commit
yes
Qwen use qwen-style
dynamic ntk
and logn-scaling to generate better text in case of long context text input.The trtllm implementation of qwen does not support logn-scaling right now, which result in low quality outputs.
I would like to provide a implementation. However, its a little diffcult for me to understand the
gpt_attention
.My vanilla thought is multiplying
q
tensor withlogn
tensor before call gpt_attention. But everyseq_len_idx
value ofq
tensor is needed for caculating logseq_len_trained
(seq_len_idx
). I don't know how to getseq_len_idx
value, especially in packed tensor mode.Would you please give some help on this?Is there a convenient way to achieve this(even in a dirty hard-code way)?