Closed CSEEduanyu closed 6 months ago
@leeyeehoo attn_output = self._flash_attention_forward( query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate, use_sliding_windows=use_sliding_windows, ) And this is actually a pointer to qkv, how to get the actual k and v to take the length of max_cache_length instead of the length of q_len?
Only kv is compressed. Is the size of Q and K inconsistent when attention is calculated?