exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
GNU General Public License v3.0
6.56k stars 342 forks source link

Add Llama.cpp Support #183

Open bayedieng opened 2 weeks ago

bayedieng commented 2 weeks ago

This PR adds support for Llama.cpp and closes #167.

AlexCheema commented 1 week ago

Hey @bayedieng just checking in. Anything I can help with to move this along?

bayedieng commented 1 week ago

Hey @AlexCheema I indeed was initially having trouble understanding the codebase however it's clearer now (Inheritance can be confusing). I wrote a basic sharded inference engine class and will proceed with the implementation.

My plan is to largely follow the implementation of the pytorch and tinygrad inference engines implementation with the only exception being skipping the tokenizer part of the problem. The Llama CPP API tokenizer is tied to the Llama class being instantiated. Also, the Tokenizer being defined in the other implementations don't seem to be tokenizing inputs but rather applies a chat template in the handle_chat_completions function of the ChatGPT API. I will be implementing it manually later in the call stack.

I will let you know if I have any further questions and am looking to have atleast a one working ai model being inferenced later today.