Only kv is compressed. Is the size of Q and K inconsistent when attention is calculated? - Githubissues

FasterDecoding / SnapKV

141 stars 4 forks source link

Only kv is compressed. Is the size of Q and K inconsistent when attention is calculated? #10

Closed CSEEduanyu closed 1 month ago

CSEEduanyu commented 1 month ago

Only kv is compressed. Is the size of Q and K inconsistent when attention is calculated?

CSEEduanyu commented 1 month ago

@leeyeehoo attn_output = self._flash_attention_forward( query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate, use_sliding_windows=use_sliding_windows, ) And this is actually a pointer to qkv, how to get the actual k and v to take the length of max_cache_length instead of the length of q_len?