Open Enchante503 opened 2 months ago
Please provide logs.
In the current version, Ollama uses only RAM when VRAM is less then a minimum size, and then computes everything exclusively using the CPU (the GPU is not being used at all).
Ollama is not being used. I'm talking about llama-cpp-python, VRAM is 24GB, and logs are as shown in the report.
Maybe this behavior is implemented in llama.cpp (which is also used by Ollama).
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Prioritize use of VRAM, and start using shared memory when memory is exceeded and Fast inference
Current Behavior
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 When you use this option, RAM will be used first instead of VRAM. Also, the specified GPU will not be used first.
llama_print_timings: total time = 56361.73 ms / 45 tokens
Hiding the option makes it super fast
llama_print_timings: total time = 40.95 ms / 143 tokens
Environment and Context