exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
GNU General Public License v3.0
15k stars 811 forks source link

[BOUNTY - $100] Support Llama 3.2 1B on tinygrad #378

Closed AlexCheema closed 4 days ago

AlexCheema commented 3 weeks ago
Sanchay-T commented 3 weeks ago

Hi @AlexCheema,

I’d love to work on adding support for Llama 3.2 1B in tinygrad.

Thanks! Sanchay

AlexCheema commented 3 weeks ago

Hi @AlexCheema,

I’d love to work on adding support for Llama 3.2 1B in tinygrad.

Thanks! Sanchay

Go for it!

Sanchay-T commented 3 weeks ago

Hey @AlexCheema ! Thanks for assigning me this issue. I'm new to this codebase and trying to understand how everything works before diving into implementing Llama 3.2 support for tinygrad.

I've spent some time reading through the code, and here's what I understand so far (please correct me if I'm wrong anywhere!):

When someone sends a message in the chat, it starts from the frontend in index.html where Alpine.js handles the UI:

async processMessage(value) {
    const response = await fetch("/v1/chat/completions", {
      method: "POST",
      body: JSON.stringify({
        model: this.cstate.selectedModel,  // This is where we specify Llama 3.2
        messages: this.cstate.messages,
        stream: true,
      }),
    });

This message then goes through the router in the backend:

@router.post("/v1/chat/completions")
async def chat_completions(request: Request):
    body = await request.json()
    model_name = body.get("model", "llama-3.1-8b")

From there, it gets sent to either the MLX or tinygrad implementation. I see that MLX already supports Llama 3.2 1B, but tinygrad needs to be updated. Looking at llama.py in tinygrad, I think the main changes needed are in the RoPE (Rotary Position Embedding) implementation:

def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0, dtype=dtypes.half):
    freqs = 1.0/(theta**(Tensor.arange(0, dim, 2)[:(dim // 2)]/dim))
    # This part might need updating for 3.2

I'm thinking of approaching this implementation in the following way:

First, I'd like to try running the current tinygrad implementation with Llama 3.2 weights to see what actually breaks. This might give us a clearer picture of what needs to change.

Then, based on what I've seen in the MLX implementation, we might need to update a few things:

I'm still learning about these concepts, so I'd really appreciate any guidance on whether this approach makes sense. Also, I noticed the MLX implementation handles some things differently - should I be looking at that as a reference for the changes?

Thanks for any help you can provide! I'm excited to learn and contribute to this project.

maujim commented 2 weeks ago

Any update on this @Sanchay-T?

I tried working on it a bit and I'm not sure which set of weights to download? I tried a few and https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-bnb-4bit seemed the most promising, but the state_dict there has some extra keys.

Sanchay-T commented 2 weeks ago

You can download the weights from the official link as well

maujim commented 2 weeks ago

Yes will do that too. But that HF link is not feasible for putting in models.py since it needs to be downloaded automatically and the meta repo requires authentication

Sanchay-T commented 2 weeks ago

Ya the authentication approval you will get that in minutes

maujim commented 2 weeks ago

Yeah got it pretty quick, surprised me how fast it was tbh

HF link is not feasible for putting in models.py

I meant this because you can't expect every exo user to have auth to pull the official weights, so we'd need to link a different set of weights so it works out of the box

But it will be useful for testing for now, will try tomorrow probably

Sanchay-T commented 2 weeks ago

Exactly!

Lets do this together if you are up for it!

Here you go : sanchay.me

maujim commented 2 weeks ago

Sure, I sent you a connection request on LinkedIn