Lightning-AI / LitServe

Lightning-fast serving engine for AI models. Flexible. Easy. Enterprise-scale.
https://lightning.ai/docs/litserve
Apache License 2.0
2.2k stars 134 forks source link

Add Keep-Alive Functionality for GPU Resource Optimization in LitServe #304

Open skyking363 opened 4 days ago

skyking363 commented 4 days ago

🚀 Feature

I would like to propose adding a feature to LitServe that enables models to be deployed with a keep-alive functionality, similar to what Ollama provides. This feature would allow the model to be unloaded from GPU memory when not in use and automatically loaded back when required.

Motivation

This feature would be helpful for users working with limited GPU resources. Currently, the GPU can become a bottleneck when multiple models are deployed. By releasing the GPU resources when a model is idle and reloading them on demand, we could improve efficiency and free up resources for other tasks.

Pitch

The main objective is to add a mechanism, perhaps through environment variables, that allows the system to automatically unload models when idle and reload them when needed, similar to Ollama's keep-alive functionality.

Alternatives

An alternative solution could involve manually managing GPU resources at the deployment level, but this can be cumbersome and error-prone. Automation via LitServe would streamline this process.

Additional context

This idea is inspired by a similar feature discussed in the Ollama repository: Ollama keep-alive environment variables. It could significantly optimize resource usage in environments where GPUs are scarce.

aniketmaurya commented 4 days ago

hi @skyking363, thank you for your interest in LitServe and for suggesting a new feature. LitServe is designed for serving high-throughput servers at scale, while Ollama is intended to run LLMs on personal devices.

Tagging @lantiga @williamFalcon to hear their thoughts.

aceliuchanghong commented 4 days ago

I am looking for this features too.

I was trying to delete the model or empty the gpu. But it can't works.

maybe it can be an option. That would be great.

Thank you for reading and replying this

aniketmaurya commented 3 days ago

hi @aceliuchanghong, thank you for adding in to the discussion. Few questions:

williamFalcon commented 3 days ago

@skyking363 @aceliuchanghong thanks for your requests!

can you explain the motivation a bit more clearly with a concrete example?

etc….

basically i have about a million questions here haha. So, it would be better to understand concretely based on a real-world example that shows what problem you want to solve and how this would solve it (a lifecycle diagram might help too).

aceliuchanghong commented 3 days ago

it's happens when i use a visual model to ocr some complicated image. so i use litserve as api to do it.

but i just use it occasionally.so i want to the gpu can be free when most time i don't use it. i just want the api service load model weight when i request it.when maybe 5 or 10 minites after no one request it.the api service can release the gpu space.

thankyou for replying~

add..cause i only have one machine with 4 L20,SO there are many service on it all time...xd

grumpyp commented 3 days ago

@aniketmaurya Would it make sense to introduce some kind of model unloading if the server has a certain amount of time without any request? And then lazy-loading it back to memory - similar to some idle-state let's say?

I did not think of an implementation scenario yet, but like @aceliuchanghong mentions, some people run many services on one machine..

aniketmaurya commented 1 day ago

I did not think of an implementation scenario yet, but like @aceliuchanghong mentions, some people run many services on one machine..

I think the main question here would be that in a production environment do you do this?

aceliuchanghong commented 1 day ago

I did not think of an implementation scenario yet, but like @aceliuchanghong mentions, some people run many services on one machine..

I think the main question here would be that in a production environment do you do this?

yeah.we use litserve in production env or that's why i don't use fastapi or something else cause it support llm(etc.) very well

skyking363 commented 18 hours ago

hi @skyking363, thank you for your interest in LitServe and for suggesting a new feature. LitServe is designed for serving high-throughput servers at scale, while Ollama is intended to run LLMs on personal devices.

  • You can also use Ollama along with LitServe by loading the Ollama client in the LitAPI.setup method.
  • Incorporating this feature directly into LitServe would take it in a different direction than our target.

Tagging @lantiga @williamFalcon to hear their thoughts.

Thank you for your reply. I currently choose to use LitServe instead of Ollama for two main reasons:

LitServe offers more flexibility compared to Ollama, such as the ability to return both sparse and dense embedding vectors during the embedding process, something that Ollama cannot do. LitServe demonstrates superior GPU efficiency and throughput, which is crucial for my application needs. This is why I prefer to move away from Ollama, and therefore I won't be adopting the suggestion to load the Ollama client.

Thank you again for your suggestions and support!

skyking363 commented 18 hours ago

@skyking363 @aceliuchanghong thanks for your requests!

can you explain the motivation a bit more clearly with a concrete example?

  • are you running multiple servers on the samw GPU?
  • what problem is this solving for you? keeping GPU RAM free? if so, why? are you running other servers on the machine?
  • are you expecting the processes to disconnect also?
  • remember ollama is very different from lightning. if all you are doing is serving an llm (like ollama) then turning a server on/off might be okay. But what if you are serving something more complex like RAG with multiple models, DB connections, vector caches, etc… which model do you offload? when? why?

etc….

basically i have about a million questions here haha. So, it would be better to understand concretely based on a real-world example that shows what problem you want to solve and how this would solve it (a lifecycle diagram might help too).

Thank you for your reply, @williamFalcon !

To provide a more concrete example of my use case:

I am running multiple services on a machine with 8 A100 GPUs. These services involve running multiple LLMs simultaneously (e.g., Llama 3.1 405B, Llama 3.2 90B, etc.), which are either used for user chat interactions or periodic tasks (such as ingesting data into a database). Additionally, I have some long-running API services that utilize multiple models, including visual models for tasks like optical character recognition (OCR). However, these models are not always in use—there are often long idle periods between requests.

My goal is to release GPU memory during these idle periods so that other services can utilize the resources without shutting down the API service itself. Ideally, the models would automatically load when a request comes in and unload after a prolonged period of inactivity. This way, we can more efficiently utilize GPU resources without manual management or service restarts.

This mechanism would allow us to manage limited GPU resources more flexibly and efficiently, especially when running services involving RAG or multi-model combinations. Of course, I understand that in more complex production environments, automatic unloading may not always be appropriate, but in scenarios where models are only used at specific times, this feature could be extremely beneficial.

Thank you again for your detailed response and suggestions! I will consider using a lifecycle diagram to further clarify how this functionality could be implemented.