Closed MarsJacobs closed 2 months ago
Hi! The real relative position is calculated by: group_query_position - group_key_position during self-attention. Hence, by adding a shift to group_query_position, we can achieve the shifting to the grouped attention area.
If we also add the shift to the group_key_position, its equivalent to: group_query_position = query_position // group_size_1 group_key_position = key_position // group_size_1
There is no shift indeed.
Hope this explanation can help
Thanks for quick answer! solved my question :)
Hi! I love your work and code implementation. Learned a lot. I have couple questions regarding code implementation.
https://github.com/datamllab/LongLM/blob/6e25a310a3aa9f49b0c74f9a277d40d897e97c2a/self_extend_patch/Llama.py#L294-L295
I understand that
group_query_position
is generated according to the formula shown in figure 3 of the paper. However, I am curious whygroup_key_position
is simply determined by dividing bygroup_size
, (without neighbor attention) unlike thequery
. Could you please clarify if I am missing something here?Thank you in advance for your help.