Add paged attention kernels for the Llama model architecture

atoma-network / atoma-paged-attention

Paged attention cuda kernels for the Atoma protocol

1 stars 5 forks source link

Add paged attention kernels for the Llama model architecture #1

Open jorgeantonio21 opened 3 weeks ago

jorgeantonio21 commented 3 weeks ago

Following the paged attention paper, add cuda kernels for the Llama model. Cuda kernels for the Llama architecture have been widely implemented in the open source community.

Moreover, paged attention kernels can be found in the original vLLM implementation here. Of special interest for us is the cuda implementation in attention_kernels.cu, which implements the attention logic using the pagination algorithms (described in the reference paper above).

NOTE: Issue 1 must be completed in order to progress onto issues 2,3.4

jorgeantonio21 commented 3 weeks ago

In order to further guide the work on this issue, we suggest the following:

[ ] Get familiar with the main ideas in the paper.
[ ] Understand how the Llama architecture works, at a high level (how KV cache is used, how attention is calculated, etc).
[ ] Get familiar with the cuda code in attention_kernels.cu.
[ ] Once you are familiar with the code in attention_kernels.cu, you can try your own implementation, or simply add the file here. Make sure, you include all necessary dependencies.
[ ] Create a new PR, tagging this issue, and explaining briefly your understanding of paged attention and how it can be used to optimized Llama serving AI inference.

RajeshRk18 commented 2 weeks ago

Hey @jorgeantonio21! Shall I take up this issue? I will follow every step as shown above

jorgeantonio21 commented 2 weeks ago

@RajeshRk18 please feel free to tackle this one down. Also, if any questions/issues arise while tackling this one feel free to ping me

g4titanx commented 2 weeks ago

hello, I would love to work on this too

fishonamos commented 2 weeks ago

Hi @jorgeantonio21. I will love to work on this.

Josh-121 commented 2 weeks ago

Pls assign if available

No-bodyq commented 2 weeks ago

Hi @jorgeantonio21 ... I'd like to work on this