is P_k wrong in the code release?

HKUNLP / ChunkLlama

[ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"

Apache License 2.0

341 stars 18 forks source link

is P_k wrong in the code release? #4

Closed DavideHe closed 6 months ago

DavideHe commented 7 months ago

https://github.com/HKUNLP/ChunkLlama/blob/main/chunkllama_attn_replace.py#L133 In the paper ,Pk = [0, 1, . . . , l − 1] mod s But in the code ,

key_states = apply_rotary_pos_emb(key_states, k_cos, k_sin, position_ids)
position_ids = position_ids % chunk_len   ## mod after key_states pos_emb .

If I understand wrong，Please point out my mistake

ChenxinAn-fdu commented 7 months ago

position ids is [0,1,2...l-1] and chunk_len is s. So the new position ids for keys are position_ids % chunk_len. Please feel free to discuss with us if we misunderstand your question.

DavideHe commented 6 months ago

position ids is [0,1,2...l-1] and chunk_len is s. So the new position ids for keys are position_ids % chunk_len. Please feel free to discuss with us if we misunderstand your question.

but In the paper , Formula 2 show the Pintra = Pk = [0, 1, . . . , l − 1] mod s. In the code , key_states = apply_rotary_pos_emb(key_states, k_cos, k_sin, position_ids) . the position_ids is not moded yet before sending to apply_rotary_pos_emb. In my thought, Pk will not change with diffrent chunks and keep 0 - s Periodically，Pq will change in diffrent chunks.

ChenxinAn-fdu commented 6 months ago

Your understanding is right! So I guess you must forget to read this line.

DavideHe commented 6 months ago

I mean that key_states = apply_rotary_pos_emb(key_states, k_cos, k_sin, position_ids % chunk_len)

ChenxinAn-fdu commented 6 months ago

Sure! This will have the effect, but you have to re-initialized k_t. In our implementation, this line have already %chunk_len in the RoPE embedding part.

DavideHe commented 6 months ago

Sure! This will have the effect, but you have to re-initialized k_t. In our implementation, this line have already %chunk_len in the RoPE embedding part.

I understand, thank yout replying. In my thought, I don't want to initialize dobule cos sin. I overlook the self.rotary_emb becuase of no q , k sending in it .