TabbyML / tabby

Self-hosted AI coding assistant
https://tabbyml.com
Other
21.87k stars 997 forks source link

bug: `--chat-device` option broken (Mixed GPU + CPU for completion + chat models) #2527

Open jtbr opened 4 months ago

jtbr commented 4 months ago

Please describe the feature you want

I've been using a large completion model with my GPU. I'd like to add a chat model as well, but there's not enough GPU memory for the large completion model plus a reasonable sized chat model. Since the latter is less latency-dependent, it would seem to make sense to put it on the CPU. That way I don't have to sacrifice completion speed or performance. But I don't see a way (at least with docker) to put models on different devices. Am I missing something? This would seem to be useful feature for many.


Please reply with a 👍 if you want this feature.

wsxiaoys commented 4 months ago

Thank you for submitting the feature request. This aligns well with the need for more precise control over how the model is served. I recommend initiating the model serving backend independently and connecting Tabby to it through an HTTP backend. For a concise guide, please visit our documentation at https://tabby.tabbyml.com/docs/administration/model/#llamacpp. For example, you can launch the model serving backend using llama.cpp and manage the number of layers processed on the GPU with the -ngl flag.

Should you face any challenges during your experimentation, please don't hesitate to share them here

jtbr commented 4 months ago

Thanks for your response. Am I correct in understanding your proposal is to run llama.cpp outside of tabby's container, and point tabby to that server for the chat completion? Or is this something that tabby can/will do within the docker container?

wsxiaoys commented 4 months ago

Either way is possible - though you need to deal with the orchestration (process level within container or container level) carefully.

jtbr commented 4 months ago

I found that the tabby serve command has a --chat-device that seems to be exactly what I was looking for.

However it doesn't seem to be working for me in 0.12.0: If --device is cuda, I still see that both models are placed into GPU memory even if --chat-device cpu is set.

(I also tried running a separate tabby docker instance for the chat model (in CPU mode), while pointing the main docker instance to it with [model.chat.http]. However I am currently blocked in testing this workaround approach by #2422)

SpeedCrash100 commented 4 months ago

You can consider ollama backend https://tabby.tabbyml.com/docs/administration/model/#ollama for this purpose. I am using it partly for that. It has Env config for control how many models can be loaded at time. It defaults to 1. This works good if you are ok that only one works at the time(Completion, chat). The ollama will automatically unload completion model and loads chat model to fulfill chat request and then it will switch back to completion model when new request for completion received. I find it more convenient for models ~10+ B parameters because temporary switching to it much much faster than using CPU only for the model.

CleyFaye commented 2 months ago

This might not be the best place, but since I got the idea there, I'll ask. I tried using ollama to do the "load only one model" thing, but with the config in the documentation, chat completions will not work. TabbyML do a POST on "/chat/completions" which returns a 404.

Config:

[model.completion.http]
model_id = "Code"
kind = "ollama/completion"
model_name = "codellama:7b"
api_endpoint = "http://127.0.0.1:11434"
prompt_template = "<PRE> {prefix} <SUF>{suffix} <MID>"

[model.chat.http]
model_id = "Chat"
kind = "openai/chat"
model_name = "mistral:7b"
api_endpoint = "http://localhost:11434"

Actual code completion in IDE do work. I'd appreciate any help on this, as I assume it should not be too complicated to setup. However, if this is too much information to discuss in this issue I'll gladly move that somewhere else.

wsxiaoys commented 2 months ago

According to https://ollama.com/blog/openai-compatibility

would it possible you need to append /v1 to your configuration? e.g

[model.chat.http]
model_id = "Chat"
kind = "openai/chat"
model_name = "mistral:7b"
api_endpoint = "http://localhost:11434/v1"
CleyFaye commented 2 months ago

Ah, yes. That was it; I didn't dig enough, sorry for the noise, and thanks, it works fine now!

wsxiaoys commented 2 months ago

In case you wanna share your setup - feel free to start a discussion thread in https://github.com/TabbyML/tabby/discussions/categories/show-and-tell, thank you!