Enchante503 commented 2 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[ ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[ ] I carefully followed the README.md.
[ ] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[ ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Prioritize use of VRAM, and start using shared memory when memory is exceeded and Fast inference

Current Behavior

export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 When you use this option, RAM will be used first instead of VRAM. Also, the specified GPU will not be used first. llama_print_timings: total time = 56361.73 ms / 45 tokens

Hiding the option makes it super fast llama_print_timings: total time = 40.95 ms / 143 tokens

Environment and Context

Windows11 WSL2 Ubuntu 22.04.4 LTS
CUDA12.1

Python 3.10.11
GNU Make 4.3     x86_64-pc-linux-gnu
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

kripper commented 2 days ago

Please provide logs.

In the current version, Ollama uses only RAM when VRAM is less then a minimum size, and then computes everything exclusively using the CPU (the GPU is not being used at all).

Enchante503 commented 1 day ago

Ollama is not being used. I'm talking about llama-cpp-python, VRAM is 24GB, and logs are as shown in the report.

kripper commented 1 day ago

Maybe this behavior is implemented in llama.cpp (which is also used by Ollama).

abetlen / llama-cpp-python

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1　 behavior is strange. #1720

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

abetlen / llama-cpp-python

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 behavior is strange. #1720

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1　 behavior is strange. #1720