datamllab / LongLM

[ICML'24 Spotlight] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
https://arxiv.org/pdf/2401.01325.pdf
MIT License
549 stars 54 forks source link

Questions regarding group query/key positional index #34

Closed MarsJacobs closed 2 months ago

MarsJacobs commented 2 months ago

Hi! I love your work and code implementation. Learned a lot. I have couple questions regarding code implementation.

https://github.com/datamllab/LongLM/blob/6e25a310a3aa9f49b0c74f9a277d40d897e97c2a/self_extend_patch/Llama.py#L294-L295

I understand that group_query_position is generated according to the formula shown in figure 3 of the paper. However, I am curious why group_key_position is simply determined by dividing by group_size, (without neighbor attention) unlike the query . Could you please clarify if I am missing something here?

Thank you in advance for your help.

Mooler0410 commented 2 months ago

Hi! The real relative position is calculated by: group_query_position - group_key_position during self-attention. Hence, by adding a shift to group_query_position, we can achieve the shifting to the grouped attention area.

If we also add the shift to the group_key_position, its equivalent to: group_query_position = query_position // group_size_1 group_key_position = key_position // group_size_1

There is no shift indeed.

Hope this explanation can help

MarsJacobs commented 2 months ago

Thanks for quick answer! solved my question :)