Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, and more.
https://anythingllm.com
MIT License
22.51k stars 2.28k forks source link

[FEAT]: Ollama `n_ctx` for VRAM allocation and performance on responses #1991

Closed coniuc2d closed 1 month ago

coniuc2d commented 1 month ago

How are you running AnythingLLM?

Docker (remote machine)

What happened?

Install Ollama using provided script (linux version, ubuntu 22.04), install AnythingLLM using provided easy script amd docker. Everything runs great however i noticed, that out of 8gb vram on my 5700xt only 74% is reserved no matter what i set in AnythingLLM. Before you shout at me, im retired plumber. It took me two days to check this out. Give me a brake if I made mistake with config:)

Are there known steps to reproduce?

In ollama serve, using /set parameter num_ctx 128000 ollama takes all my Vram and close to 22gb ram. In ollama serve, using /set parameter num_ctx 11200 ollama takes 99% Vram and responses are much much better. In ollama serve, using default settings (for newbies, like me:)) only 74% of Vram is reserved. Responses are worse than above.

Looks like AnythingLLM is not forwarding changes of context to ollama. Whatever you set, default llama3.1:latest stays at 1024.

timothycarambat commented 1 month ago

Ollama serve does not overwrite API connection request defaults and information, which is how AnythingLLM communicates with Ollama.

As you can see here, we do not set numCTX https://github.com/Mintplex-Labs/anything-llm/blob/ae58a2cb0db7c74d9c43c3f8ac748725e28ab31c/server/utils/AiProviders/ollama/index.js#L31

This comment below though is probably more helpful by editing the model's Modelfile for Ollama and it will be persistent and is probably better handled there. https://github.com/ollama/ollama/issues/5965#issuecomment-2252354726

The main reason I hesitate to auto-set n_ctx to the model's max token size is that the 100% VRAM reservation by default with AnythingLLM may be not what people want. intuitively, I have to assume the Ollama team set that limit for some reason - although it is not outright stated why.

Either way, this for sure falls under a "Feature/improvement" as opposed to a bug (like https://github.com/Mintplex-Labs/anything-llm/pull/1920), but that comment above should help keep your persistent settings 👍

coniuc2d commented 1 month ago

Thanks for patience! For those that (like me) are learning, llama3.1 q8_0 with num_gpu 10 and num_ctx 10240 uses 95% of 8GB Vram, about 4,5GB Ram and gives 2.5t/s. Much better responses. I will test embedding tomorrow but i expect also much better results than on stock settings. All This on xeon 12/24 and radeon 5700xt using ubuntu 22.04 and docker version of fantastic AnythingLLM.

timothycarambat commented 1 month ago

https://github.com/ollama/ollama/issues/1005#issue-1977548240