Open jtbr opened 4 months ago
Thank you for submitting the feature request. This aligns well with the need for more precise control over how the model is served. I recommend initiating the model serving backend independently and connecting Tabby to it through an HTTP backend. For a concise guide, please visit our documentation at https://tabby.tabbyml.com/docs/administration/model/#llamacpp. For example, you can launch the model serving backend using llama.cpp and manage the number of layers processed on the GPU with the -ngl
flag.
Should you face any challenges during your experimentation, please don't hesitate to share them here
Thanks for your response. Am I correct in understanding your proposal is to run llama.cpp outside of tabby's container, and point tabby to that server for the chat completion? Or is this something that tabby can/will do within the docker container?
Either way is possible - though you need to deal with the orchestration (process level within container or container level) carefully.
I found that the tabby serve
command has a --chat-device
that seems to be exactly what I was looking for.
However it doesn't seem to be working for me in 0.12.0: If --device
is cuda
, I still see that both models are placed into GPU memory even if --chat-device cpu
is set.
(I also tried running a separate tabby docker instance for the chat model (in CPU mode), while pointing the main docker instance to it with [model.chat.http]
. However I am currently blocked in testing this workaround approach by #2422)
You can consider ollama backend https://tabby.tabbyml.com/docs/administration/model/#ollama for this purpose. I am using it partly for that. It has Env config for control how many models can be loaded at time. It defaults to 1. This works good if you are ok that only one works at the time(Completion, chat). The ollama will automatically unload completion model and loads chat model to fulfill chat request and then it will switch back to completion model when new request for completion received. I find it more convenient for models ~10+ B parameters because temporary switching to it much much faster than using CPU only for the model.
This might not be the best place, but since I got the idea there, I'll ask. I tried using ollama to do the "load only one model" thing, but with the config in the documentation, chat completions will not work. TabbyML do a POST on "/chat/completions" which returns a 404.
Config:
[model.completion.http]
model_id = "Code"
kind = "ollama/completion"
model_name = "codellama:7b"
api_endpoint = "http://127.0.0.1:11434"
prompt_template = "<PRE> {prefix} <SUF>{suffix} <MID>"
[model.chat.http]
model_id = "Chat"
kind = "openai/chat"
model_name = "mistral:7b"
api_endpoint = "http://localhost:11434"
Actual code completion in IDE do work. I'd appreciate any help on this, as I assume it should not be too complicated to setup. However, if this is too much information to discuss in this issue I'll gladly move that somewhere else.
According to https://ollama.com/blog/openai-compatibility
would it possible you need to append /v1
to your configuration? e.g
[model.chat.http]
model_id = "Chat"
kind = "openai/chat"
model_name = "mistral:7b"
api_endpoint = "http://localhost:11434/v1"
Ah, yes. That was it; I didn't dig enough, sorry for the noise, and thanks, it works fine now!
In case you wanna share your setup - feel free to start a discussion thread in https://github.com/TabbyML/tabby/discussions/categories/show-and-tell, thank you!
Please describe the feature you want
I've been using a large completion model with my GPU. I'd like to add a chat model as well, but there's not enough GPU memory for the large completion model plus a reasonable sized chat model. Since the latter is less latency-dependent, it would seem to make sense to put it on the CPU. That way I don't have to sacrifice completion speed or performance. But I don't see a way (at least with docker) to put models on different devices. Am I missing something? This would seem to be useful feature for many.
Please reply with a 👍 if you want this feature.