abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.13k stars 967 forks source link

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1  behavior is strange. #1720

Open Enchante503 opened 2 months ago

Enchante503 commented 2 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Prioritize use of VRAM, and start using shared memory when memory is exceeded and Fast inference

Current Behavior

export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 When you use this option, RAM will be used first instead of VRAM. Also, the specified GPU will not be used first. llama_print_timings: total time = 56361.73 ms / 45 tokens

Hiding the option makes it super fast llama_print_timings: total time = 40.95 ms / 143 tokens

Environment and Context

Windows11 WSL2 Ubuntu 22.04.4 LTS
CUDA12.1

Python 3.10.11
GNU Make 4.3     x86_64-pc-linux-gnu
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
kripper commented 2 days ago

Please provide logs.

In the current version, Ollama uses only RAM when VRAM is less then a minimum size, and then computes everything exclusively using the CPU (the GPU is not being used at all).

Enchante503 commented 1 day ago

Ollama is not being used. I'm talking about llama-cpp-python, VRAM is 24GB, and logs are as shown in the report.

kripper commented 1 day ago

Maybe this behavior is implemented in llama.cpp (which is also used by Ollama).