TabbyML / tabby

Self-hosted AI coding assistant
https://tabby.tabbyml.com/
Other
18.28k stars 771 forks source link

bug: `--chat-device` option broken (Mixed GPU + CPU for completion + chat models) #2527

Open jtbr opened 5 days ago

jtbr commented 5 days ago

Please describe the feature you want

I've been using a large completion model with my GPU. I'd like to add a chat model as well, but there's not enough GPU memory for the large completion model plus a reasonable sized chat model. Since the latter is less latency-dependent, it would seem to make sense to put it on the CPU. That way I don't have to sacrifice completion speed or performance. But I don't see a way (at least with docker) to put models on different devices. Am I missing something? This would seem to be useful feature for many.


Please reply with a 👍 if you want this feature.

wsxiaoys commented 5 days ago

Thank you for submitting the feature request. This aligns well with the need for more precise control over how the model is served. I recommend initiating the model serving backend independently and connecting Tabby to it through an HTTP backend. For a concise guide, please visit our documentation at https://tabby.tabbyml.com/docs/administration/model/#llamacpp. For example, you can launch the model serving backend using llama.cpp and manage the number of layers processed on the GPU with the -ngl flag.

Should you face any challenges during your experimentation, please don't hesitate to share them here

jtbr commented 5 days ago

Thanks for your response. Am I correct in understanding your proposal is to run llama.cpp outside of tabby's container, and point tabby to that server for the chat completion? Or is this something that tabby can/will do within the docker container?

wsxiaoys commented 5 days ago

Either way is possible - though you need to deal with the orchestration (process level within container or container level) carefully.

jtbr commented 5 days ago

I found that the tabby serve command has a --chat-device that seems to be exactly what I was looking for.

However it doesn't seem to be working for me in 0.12.0: If --device is cuda, I still see that both models are placed into GPU memory even if --chat-device cpu is set.

(I also tried running a separate tabby docker instance for the chat model (in CPU mode), while pointing the main docker instance to it with [model.chat.http]. However I am currently blocked in testing this workaround approach by #2422)

SpeedCrash100 commented 5 days ago

You can consider ollama backend https://tabby.tabbyml.com/docs/administration/model/#ollama for this purpose. I am using it partly for that. It has Env config for control how many models can be loaded at time. It defaults to 1. This works good if you are ok that only one works at the time(Completion, chat). The ollama will automatically unload completion model and loads chat model to fulfill chat request and then it will switch back to completion model when new request for completion received. I find it more convenient for models ~10+ B parameters because temporary switching to it much much faster than using CPU only for the model.