Closed AlexCheema closed 4 days ago
Hi @AlexCheema,
I’d love to work on adding support for Llama 3.2 1B in tinygrad.
Thanks! Sanchay
Hi @AlexCheema,
I’d love to work on adding support for Llama 3.2 1B in tinygrad.
Thanks! Sanchay
Go for it!
Hey @AlexCheema ! Thanks for assigning me this issue. I'm new to this codebase and trying to understand how everything works before diving into implementing Llama 3.2 support for tinygrad.
I've spent some time reading through the code, and here's what I understand so far (please correct me if I'm wrong anywhere!):
When someone sends a message in the chat, it starts from the frontend in index.html
where Alpine.js handles the UI:
async processMessage(value) {
const response = await fetch("/v1/chat/completions", {
method: "POST",
body: JSON.stringify({
model: this.cstate.selectedModel, // This is where we specify Llama 3.2
messages: this.cstate.messages,
stream: true,
}),
});
This message then goes through the router in the backend:
@router.post("/v1/chat/completions")
async def chat_completions(request: Request):
body = await request.json()
model_name = body.get("model", "llama-3.1-8b")
From there, it gets sent to either the MLX or tinygrad implementation. I see that MLX already supports Llama 3.2 1B, but tinygrad needs to be updated. Looking at llama.py
in tinygrad, I think the main changes needed are in the RoPE (Rotary Position Embedding) implementation:
def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0, dtype=dtypes.half):
freqs = 1.0/(theta**(Tensor.arange(0, dim, 2)[:(dim // 2)]/dim))
# This part might need updating for 3.2
I'm thinking of approaching this implementation in the following way:
First, I'd like to try running the current tinygrad implementation with Llama 3.2 weights to see what actually breaks. This might give us a clearer picture of what needs to change.
Then, based on what I've seen in the MLX implementation, we might need to update a few things:
I'm still learning about these concepts, so I'd really appreciate any guidance on whether this approach makes sense. Also, I noticed the MLX implementation handles some things differently - should I be looking at that as a reference for the changes?
Thanks for any help you can provide! I'm excited to learn and contribute to this project.
Any update on this @Sanchay-T?
I tried working on it a bit and I'm not sure which set of weights to download? I tried a few and https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-bnb-4bit seemed the most promising, but the state_dict there has some extra keys.
You can download the weights from the official link as well
Yes will do that too. But that HF link is not feasible for putting in models.py since it needs to be downloaded automatically and the meta repo requires authentication
Ya the authentication approval you will get that in minutes
Yeah got it pretty quick, surprised me how fast it was tbh
HF link is not feasible for putting in models.py
I meant this because you can't expect every exo user to have auth to pull the official weights, so we'd need to link a different set of weights so it works out of the box
But it will be useful for testing for now, will try tomorrow probably
Exactly!
Lets do this together if you are up for it!
Here you go : sanchay.me
Sure, I sent you a connection request on LinkedIn
models.py