Open bayedieng opened 2 weeks ago
Hey @bayedieng just checking in. Anything I can help with to move this along?
Hey @AlexCheema I indeed was initially having trouble understanding the codebase however it's clearer now (Inheritance can be confusing). I wrote a basic sharded inference engine class and will proceed with the implementation.
My plan is to largely follow the implementation of the pytorch and tinygrad inference engines implementation with the only exception being skipping the tokenizer part of the problem. The Llama CPP API tokenizer is tied to the Llama
class being instantiated. Also, the Tokenizer being defined in the other implementations don't seem to be tokenizing inputs but rather applies a chat template in the handle_chat_completions
function of the ChatGPT API. I will be implementing it manually later in the call stack.
I will let you know if I have any further questions and am looking to have atleast a one working ai model being inferenced later today.
This PR adds support for Llama.cpp and closes #167.