Open Bram-diederik opened 8 months ago
Would also like this. I've got Ollama using a Tesla M60 but access it from different endpoints (ha automations, librechat gui, ollama cli) using different models and would be handy to be able to unload models faster!
@dansharpy for you the latest ollama version helps out. You can set the OLLAMA_KEEP_ALIVE environment varible.
For me i need a single model to be set at -1 so it is not perfect for me
Thanks for this, I was looking for an environment variable I could set but couldn't find it. Is there a list in the docs somewhere I've missed? I set to 0 in Ollama which seems to work fine in the librechat gui (i.e. unloads model as soon as its completed a request), but when a request is sent from this integration in HA it keeps it loaded. I can only assume this integration sends a keep alive parameter in its request which is overriding the environment variable. Edit: Just been looking at the logs and seems the librechat gui sends api calls to the /v1/chat endpoint and this integration sends them to /api/chat. Wonder if that has something to do with it not respecting the environment variable?
Checklist
Is your feature request related to a problem? Please describe.
The 1st time i run a promp it takes long. 30 ot 40 seconds. Every next is run in 10 seconds.
I tryed some curl commands provided in the ollama faq to preload the model but with no luck.
Perhaps it has to do with the prompt or the session or something :/
But could you add the keep alive paramerer as option. I have a cpu only system but plenty of ram.
Describe the solution you'd like
A option to set the keepalive as described in the ollama faq
Describe alternatives you've considered
Curl commands described in the faq
Additional context
I think it is compleat