atoma-network / atoma-paged-attention

Paged attention cuda kernels for the Atoma protocol
2 stars 6 forks source link

feat: Integrate Llama cuda kernels with Candle #6

Closed fishonamos closed 2 weeks ago

fishonamos commented 2 weeks ago

Modifiede original candle implementation of the Llama model to introduce paged attention,. The integration of PagedAttention replaces the standard attention mechanism, improving the model's ability to manage attention operations. The CausalSelfAttention struct now utilizes PagedAttention and precomputed tensors for rotary embeddings, optimizing the forward pass and enhancing overall performance.

The Cache struct is optimized to include precomputed cosine and sine tensors for rotary embeddings. This change reduces computational overhead during attention calculations. Aims to solve #2