Add "keep_alive" parameter for the Ollama API - Githubissues

acon96 / home-llm

A Home Assistant integration & Model to control your smart home using a Local LLM

490 stars 56 forks source link

Add "keep_alive" parameter for the Ollama API #76

Closed tannisroot closed 4 months ago

tannisroot commented 4 months ago

Please describe what you are trying to do with the component Ollama has a parameter called "keep_alive" that controls for how long it will keep the model loaded in memory. However, according to https://github.com/ollama/ollama/pull/2146#issuecomment-1911274831, even if you set the parameter to your desired default value in the webui, any subsequent /api/generate calls that don't contain keep_alive parameter will reset it back to 5m (as in, the model will be unloaded from memory after 5 minutes), which is undesirable when it's used as a conversation agent.

Describe the solution you'd like I think it would be nice to have a keep_alive in the configure menu. Personally I would like to set it to -1m so that the model I use is never unloaded, thus reducing latency in the responses I get with the integration.

acon96 commented 4 months ago

I'm partial to just hard coding the keep_alive to -1m for all requests to Ollama because if it is unloaded, the integration will just fail until Home Assistant is restarted.

acon96 commented 4 months ago

I just pushed a fix for this to develop, and will be in the next release

fixtse commented 4 months ago

I think that this should be added as a configurable option rather than just hard-coding it. When the model gets unloaded, the next time you ask something to the Assistant, it will just load it again, it doesn't fail (yes, it takes a moment, but at least on my hardware it's just a couple of seconds [3080 ti]). I shared the GPU between multiple containers, so having the model loaded all the time is not ideal.

acon96 commented 4 months ago

I think that this should be added as a configurable option rather than just hard-coding it. When the model gets unloaded, the next time you ask something to the Assistant, it will just load it again, it doesn't fail (yes, it takes a moment, but at least on my hardware it's just a couple of seconds [3080 ti]). I shared the GPU between multiple containers, so having the model loaded all the time is not ideal.

Well that's interesting behavior. Just wish it didn't default to 5 minutes. I'll hook it up to an OptionFlow variable before the next release.

acon96 commented 4 months ago

Ok the functionality is in v0.2.7. I just need to document it properly

acon96 commented 4 months ago

closing