containers / ramalama

The goal of RamaLama is to make working with AI boring.
MIT License
280 stars 48 forks source link

Switch to https://github.com/abetlen/llama-cpp-python #9

Open ericcurtin opened 3 months ago

ericcurtin commented 3 months ago

Right now we call llama.cpp directly, long-term we should go with either llama.cpp directly or llama-cpp-python. Because maintaining two different llama.cpp backends isn't ideal, they will never be in sync from a version perspective etc. More maintenance.

The API's of llama-cpp-python seem to be more stable, if we can get it to behave the same as the current implementation, we should consider switching.

Tagging @MichaelClifford as he suggested the idea and may be interested.

rhatdan commented 3 months ago

I would prefer to go with the python route.

ericcurtin commented 3 months ago

I would prefer to go with the python route.

I agree, the main problem we have right now, is this "--instruct" option in llama.cpp direct was very useful for creating daemonless interactive terminal-based chatbots:

llama-main -m model --log-disable --instruct

they have actually since removed this --instruct option in llama.cpp in the last month.

I briefly tried to to do the same with llama-cpp-python, I couldn't get something working that worked well on a wide array of models like --instruct. But I only tried for an hour maybe so, I'm sure someone would figure this out, it's something several projects have done already in one form or another.

ericcurtin commented 3 months ago

Tagging @abetlen , we also sent an email with more details to abetlen@gmail.com

MichaelClifford commented 3 months ago

Hi @ericcurtin 👋 I agree, we should go with only one and not try to support both backends. That said, I don't have a very strong opinion as to which. We currently use llama-cpp-python in the recipes as well as the extensions' playground. So if we want to keep things consistent, it probably does make the most sense to stick with llama-cpp-python here too.

My only hesitation with llama-cpp-python is, it is another layer of abstraction between us and llama-cpp that we will need to rely on. And there have been a few instances in the past (getting the granite models working for example) where llama-cpp-python lagged a bit behind llama-cpp.

So, really I'm open to either approach. Let's figure out what ramalam's requirements are and pick the tool that works best for us 😄

Ben-Epstein commented 1 month ago

For what it's worth, running on MacOS sequoia (M3), llama-cpp-python consistently fails on my machine, but ramalama in its current form works. It might be worth testing to see if that holds true among more Apple silicon machines before switching

ericcurtin commented 1 month ago

Yeah... To be honest at this point, if we do add this, it will probably be just another --runtime, like --runtime llama-cpp-python

rhatdan commented 2 weeks ago

No one is working on this, is this something we should still consider?

ericcurtin commented 2 weeks ago

llama-cppy-python does appear as though it implements a more feature complete OpenAI Compatible Server to the direct llama.cpp one, but I don't know for sure:

https://llama-cpp-python.readthedocs.io/en/latest/server/

it also implements multi-model server support.

I'm unsure, maybe we should consider it. This is one of those python things that we could probably run fine in a container, we'd probably only want read-only model access for it.

ericcurtin commented 2 weeks ago

I don't mind either way, leaving this open or closing