Modifiede original candle implementation of the Llama model to introduce paged attention,. The integration of PagedAttention replaces the standard attention mechanism, improving the model's ability to manage attention operations. The CausalSelfAttention struct now utilizes PagedAttention and precomputed tensors for rotary embeddings, optimizing the forward pass and enhancing overall performance.
Modifiede original candle implementation of the Llama model to introduce paged attention,. The integration of PagedAttention replaces the standard attention mechanism, improving the model's ability to manage attention operations. The CausalSelfAttention struct now utilizes PagedAttention and precomputed tensors for rotary embeddings, optimizing the forward pass and enhancing overall performance.
The Cache struct is optimized to include precomputed cosine and sine tensors for rotary embeddings. This change reduces computational overhead during attention calculations. Aims to solve https://github.com/atoma-network/atoma-paged-attention/issues/2