Open QIN2DIM opened 1 month ago
OLLAMA service config
# ... ↓ keypoint
[Service]
Environment="OLLAMA_KEEP_ALIVE=3h"
Environment="OLLAMA_NUM_PARALLEL=10"
Environment="OLLAMA_MAX_LOADED_MODELS=6"
Environment="OLLAMA_MAX_QUEUE=128"
Environment="OLLAMA_DEBUG=1"
# Environment="OLLAMA_FLASH_ATTENTION=1"
# Environment="OLLAMA_NOHISTORY=1"
(base) root@prd-gpu-1-180:/eam/aiops/nodes/dify_plugins/search_toolkit# ollama ps
NAME ID SIZE PROCESSOR UNTIL
yi:34b-v1.5 ff94bc7c1b7a 27 GB 100% GPU 3 hours from now
starcoder2:3b f67ae0f64584 4.1 GB 100% GPU 3 hours from now
(base) root@prd-gpu-1-180:/eam/aiops/nodes/dify_plugins/search_toolkit# ollama ps
NAME ID SIZE PROCESSOR UNTIL
codestral:latest 726512da210d 417 GB 100% CPU 29 minutes from now
yi:34b-v1.5 ff94bc7c1b7a 27 GB 100% GPU 3 hours from now
Interesting, I found that only some models would occur.
(base) root@prd-gpu-1-180:~# ollama ps
NAME ID SIZE PROCESSOR UNTIL
deepseek-coder:6.7b ce298d984115 102 GB 100% GPU 29 minutes from now
codeqwen:7b a6f7662764bd 490 GB 100% CPU 22 minutes from now
starcoder2:3b f67ae0f64584 6.0 GB 100% GPU 29 minutes from now
yi:34b-v1.5 ff94bc7c1b7a 27 GB 100% GPU 3 hours from now
That's odd. We don't set the num_gpu parameter in our request, but you could do this with requestOptions.extraBodyParameters in config. https://docs.continue.dev/reference/config
These are what we send by default: https://github.com/continuedev/continue/blob/main/core/llm/llms/Ollama.ts#L130-L139
Does anything here stand out as a potential solution?
Before submitting your bug report
Relevant environment info
Description
When the configuration of a model object is written like this, what is the value of the
num_gpu
in the request parameter? Still not set?I found that each time I will make my Ollama Reload model, and it will run the reasoning task in a pure 100%CPU PROCESSOR, that is, all the network layers are not uninstalled on the GPU.
In other words, although I have loaded a 100% GPU Processor model in Ollama, this request still allows Ollama to overload the model.
Therefore, the reasoning task is very, very slow
To reproduce
No response
Log output
No response