[BOUNTY - $500] Llama.cpp inference engine

exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚

GNU General Public License v3.0

6.56k stars 342 forks source link

[BOUNTY - $500] Llama.cpp inference engine #167

Open AlexCheema opened 3 weeks ago

AlexCheema commented 3 weeks ago

it should automatically detect the best device to run on
We should require 0 manual configuration from the user, by default llama.cpp for example requires specifying the device

danny-avila commented 2 weeks ago

Thanks @AlexCheema

Myself and many others likely only have windows systems, and llama.cpp is practically the only option.

MLX is macOS and tinygrad:

Windows support has been dropped to focus on Linux and Mac OS. Some functionality may work on Windows but no support will be provided, use WSL instead. source: https://github.com/tinygrad/tinygrad/releases/tag/v0.7.0

For this opening statement to be true, it would need to include windows-based systems, especially old gaming rigs.

Forget expensive NVIDIA GPUs, unify your existing devices into one powerful GPU: iPhone, iPad, Android, Mac, Linux, pretty much any device!

At the very least, a thorough guide on setting up tinygrad via WSL/WSL2 would be appreciated, because this is your only documentation:

Example Usage on Multiple MacOS Devices

bayedieng commented 2 weeks ago

I'd like to look into this. Adjacently, llamafiles might be worth looking into as they are binaries able to run on multiple desktop OSes without any configuration. Though I'm not sure about Android or IOS support.

AlexCheema commented 2 weeks ago

I'd like to look into this. Adjacently, llamafiles might be worth looking into as they are binaries able to run on multiple desktop OSes without any configuration. Though I'm not sure about Android or IOS support.

Go for it!

AlexCheema commented 2 weeks ago

@bayedieng I'd recommend looking at https://github.com/abetlen/llama-cpp-python -- it should hopefully be low level enough to do what we need to do. Also, I'd recommend looking at https://github.com/exo-explore/exo/pull/139 for a minimal implementation of an inference engine that doesn't require explicitly defining every model -- it's a general solution.

bayedieng commented 2 weeks ago

Thanks for the suggestion. Yeah I had already seen the python bindings and went ahead and began a draft PR.

thegodone commented 10 hours ago

I wonder if WebGPU can be plugged on top of Llama.cpp via this https://github.com/AnswerDotAI/gpu.cpp wrapper ?