Relative positions in RoPE embeddings

abertsch72 / unlimiformer

Public repo for the NeurIPS 2023 paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input"

MIT License

1.05k stars 77 forks source link

Relative positions in RoPE embeddings #46

Open AshwinRamachandran2002 opened 11 months ago

AshwinRamachandran2002 commented 11 months ago

Hi, I was going through your code to know how you calculated the RoPE embeddings and need a clarification

In assigning a relative position to a newly generated token, the base reference is taken as the end of the prompt input https://github.com/abertsch72/unlimiformer/blob/232fc235706c304667f7a671cca2203d4625eaa1/src/unlimiformer.py#L1084C10-L1084C10

In assigning a relative position to the retrieved key indices the relative position is taken as the start of the prompt input https://github.com/abertsch72/unlimiformer/blob/232fc235706c304667f7a671cca2203d4625eaa1/src/unlimiformer.py#L1123

Then would it not be the case that the current hidden state gives more attention to the tokens somewhere in the middle of the prompt and then decays both to the right and left?

Thank you Ashwin Ramachandran

urialon commented 11 months ago

Hi @AshwinRamachandran2002 , Thank you for your interest in our work!

Your reading is correct, and you are looking at the right places in the code:

In assigning a position for the query, the position we give it is "the number of generated tokens so far".
In assigning a position for a retrieved key, the position we give it is "its relative position in the initial long prompt".

These are the settings that we found to work the best in our initial experiments. I agree that it may not be optimal. But I can't say whether "the current hidden state gives more attention to the tokens somewhere in the middle of the prompt and then decays both to the right and left" - it's an hypothesis that is worth checking, and possibly fixing (and writing a paper about if you manage to do that :-) )

Please let us know if you have any questions! Uri

AshwinRamachandran2002 commented 11 months ago

Thank you for your reply I would like to also know how decided upon the vectorstore query

https://github.com/abertsch72/unlimiformer/blob/232fc235706c304667f7a671cca2203d4625eaa1/src/unlimiformer.py#L1098 You have used an approximation to the R(m) * W_k as W_k + Rotated(W_k)

Did you also consider dropping R(m) ?