geerlingguy / ollama-benchmark

Simple ollama benchmarking tool.
MIT License
20 stars 1 forks source link

Benchmark AMD GPUs on Raspberry Pi 5 #1

Open geerlingguy opened 3 days ago

geerlingguy commented 3 days ago

To get this to work, first you have to get an external AMD GPU working on Pi OS. The most up-to-date instructions are currently on my website: Get an AMD Radeon 6000/7000-series GPU running on Pi 5.

Once your AMD graphics card is working (and can output video), install dependencies and compile llama.cpp with the Vulkan backend:

# Install Vulkan SDK, glslc, and cmake
sudo apt install -y libvulkan-dev glslc cmake

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with Vulkan
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release
# Test the output binary (with "-ngl 33" to offload all layers to GPU)
./build/bin/llama-cli -m "PATH_TO_MODEL" -p "Hi you how are you" -n 50 -e -ngl 33 -t 4

# You should see in the output, ggml_vulkan detected your GPU. For example:
# ggml_vulkan: Found 1 Vulkan devices:
# ggml_vulkan: 0 = AMD Radeon RX 6700 XT (RADV NAVI22) (radv) | uma: 0 | fp16: 1 | warp size: 64

Then you can download a model (e.g. off HuggingFace) and run it:

# Download llama3.2:3b
cd models && wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf

# Run it.
cd ../
./build/bin/llama-cli -m "models/Llama-3.2-3B-Instruct-Q4_K_M.gguf" -p "Why is the blue sky blue?" -n 50 -e -ngl 33 -t 4

I want to test with models:

Device CPU/GPU Model Speed Power (Peak)
Pi 5 - 8GB CPU llama3.2:3b 4.61 Tokens/s 13.9 W
Pi 5 - 8GB CPU llama3.2:8b 1.99 Tokens/s 13.2 W
Pi 5 - 8GB CPU llama2:13b DNF DNF
Pi 5 - 8GB / AMD RX 6500 XT 8GB GPU llama3.2:3b 39.82 Tokens/s 88 W
Pi 5 - 8GB / AMD RX 6500 XT 8GB GPU llama3.1:8b 22.42 Tokens/s 95.7 W
Pi 5 - 8GB / AMD RX 6500 XT 8GB GPU llama2:13b 2.03 Tokens/s 48.3 W
Pi 5 - 8GB / AMD RX 6700 XT 12GB GPU llama3.2:3b 49.01 Tokens/s 94 W
Pi 5 - 8GB / AMD RX 6700 XT 12GB GPU llama3.1:8b 39.70 Tokens/s 135 W
Pi 5 - 8GB / AMD RX 6700 XT 12GB GPU llama2:13b 3.98 Tokens/s 95 W
Pi 5 - 8GB / AMD RX 7600 8GB GPU llama3.2:3b 48.47 Tokens/s 156 W
Pi 5 - 8GB / AMD RX 7600 8GB GPU llama3.1:8b 32.60 Tokens/s 174 W
Pi 5 - 8GB / AMD RX 7600 8GB GPU llama2:13b 2.42 Tokens/s 106 W
Pi 5 - 8GB / AMD Radeon Pro W7700 16GB GPU llama3.2:3b 56.14 Tokens/s 145 W
Pi 5 - 8GB / AMD Radeon Pro W7700 16GB GPU llama3.1:8b 39.87 Tokens/s 52 W
Pi 5 - 8GB / AMD Radeon Pro W7700 16GB GPU llama2:13b 4.38 Tokens/s 108 W

Note: Ollama currently doesn't support Vulkan, and some parts of llama.cpp assume x86 still, not Arm or RISC-V.

Note 2: With larger models, you may run into an error like vk::Device::allocateMemory: ErrorOutOfDeviceMemory—see bug Vulkan Device memory allocation failed. If so, try scaling back to 1 or 2 GB of RAM for the buffer:

export GGML_VK_FORCE_MAX_ALLOCATION_SIZE=2147483647  # 2GB buffer
export GGML_VK_FORCE_MAX_ALLOCATION_SIZE=1073741824  # 1GB buffer

Note 3: Power consumption measured at the wall (total system power draw) using a ThirdReality Zigbee Smart Outlet through Home Assistant. I don't have a way of measuring total energy consumed per test (e.g. Joules) but that would be nice at some point.

geerlingguy commented 3 days ago

To test on the Pi's CPU instead of GPU, I ran the same commands as above without -ngl 33.

geerlingguy commented 3 days ago

Also, over on Reddit, user u/kryptkpr suggested using ollama-bench.

RX 6500 XT

Idle power of Pi 5 + RX 6500 XT (monitor turned off, USB SSD plugged in): 11W

pi@pi5-pcie:~/Downloads/llama.cpp $ ./build/bin/llama-bench -m models/llama-3.2-1b-instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
model size params backend ngl test t/s
llama 1B Q4_K - Medium 762.81 MiB 1.24 B Vulkan 99 pp512 1079.41 ± 1.12
llama 1B Q4_K - Medium 762.81 MiB 1.24 B Vulkan 99 pp4096 879.92 ± 0.66
llama 1B Q4_K - Medium 762.81 MiB 1.24 B Vulkan 99 tg128 70.75 ± 0.14
llama 1B Q4_K - Medium 762.81 MiB 1.24 B Vulkan 99 pp4096+tg128 548.34 ± 6.59

Consumed around 105W on average during the tests.

RX 6700 XT

Idle power of Pi 5 + RX 6700 XT (monitor turned off, USB SSD plugged in): 11.7W

pi@pi5-pcie:~/Downloads/llama.cpp $ ./build/bin/llama-bench -m models/llama-3.2-1b-instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
model size params backend ngl test t/s
llama 1B Q4_K - Medium 762.81 MiB 1.24 B Vulkan 99 pp512 20.05 ± 0.05
llama 1B Q4_K - Medium 762.81 MiB 1.24 B Vulkan 99 pp4096 1356.21 ± 59.36
llama 1B Q4_K - Medium 762.81 MiB 1.24 B Vulkan 99 tg128 77.21 ± 0.19
llama 1B Q4_K - Medium 762.81 MiB 1.24 B Vulkan 99 pp4096+tg128 824.60 ± 1.40

Consumed around 172W on average during the tests.

pi@pi5-pcie:~/Downloads/llama.cpp $ ./build/bin/llama-bench -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
model size params backend ngl test t/s
llama 3B Q4_K - Medium 1.87 GiB 3.21 B Vulkan 99 pp512 779.05 ± 68.94
llama 3B Q4_K - Medium 1.87 GiB 3.21 B Vulkan 99 pp4096 650.14 ± 1.00
llama 3B Q4_K - Medium 1.87 GiB 3.21 B Vulkan 99 tg128 51.71 ± 3.02
llama 3B Q4_K - Medium 1.87 GiB 3.21 B Vulkan 99 pp4096+tg128 406.28 ± 2.45

Consumed around 195W on average during the tests.

Radeon Pro W7700

Idle power of Pi 5 + W7700 (monitor turned off, USB SSD plugged in): 19.1W

pi@pi5-pcie:~/Downloads/llama.cpp $ ./build/bin/llama-bench -m models/llama-3.2-1b-instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
model size params backend ngl test t/s
llama 1B Q4_K - Medium 762.81 MiB 1.24 B Vulkan 99 pp512 33.27 ± 0.01
llama 1B Q4_K - Medium 762.81 MiB 1.24 B Vulkan 99 pp4096 1758.46 ± 3.79
llama 1B Q4_K - Medium 762.81 MiB 1.24 B Vulkan 99 tg128 84.54 ± 0.15
llama 1B Q4_K - Medium 762.81 MiB 1.24 B Vulkan 99 pp4096+tg128 970.73 ± 7.66

Consumed around 95W on average during the tests. (Jumping to 147W for the last bit.)

pi@pi5-pcie:~/Downloads/llama.cpp $ ./build/bin/llama-bench -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
model size params backend ngl test t/s
llama 3B Q4_K - Medium 1.87 GiB 3.21 B Vulkan 99 pp512 882.01 ± 20.38
llama 3B Q4_K - Medium 1.87 GiB 3.21 B Vulkan 99 pp4096 744.28 ± 3.14
llama 3B Q4_K - Medium 1.87 GiB 3.21 B Vulkan 99 tg128 57.54 ± 1.03
llama 3B Q4_K - Medium 1.87 GiB 3.21 B Vulkan 99 pp4096+tg128 460.41 ± 3.69

Consumed around 174W on average during the tests.

geerlingguy commented 3 days ago

Also tested, as suggested on Reddit, a larger 9GB model, Qwen2.5-14B-Instruct-Q4_K_M.gguf. This was on the RX 6700 XT.

model size params backend ngl test t/s
qwen2 ?B Q4_K - Medium 8.37 GiB 14.77 B Vulkan 99 pp512 188.75 ± 0.08
qwen2 ?B Q4_K - Medium 8.37 GiB 14.77 B Vulkan 99 pp4096 166.16 ± 0.04
qwen2 ?B Q4_K - Medium 8.37 GiB 14.77 B Vulkan 99 tg128 21.27 ± 0.27
qwen2 ?B Q4_K - Medium 8.37 GiB 14.77 B Vulkan 99 pp4096+tg128 124.52 ± 0.56

Consumed around 205W on average during the tests.

geerlingguy commented 3 days ago

And again, tested Mistral-Small-Instruct-2409-Q4_K_M.gguf:

Consumed around 90W on average during the tests.

But after running a number of times, I would get a GPU reset halfway through:

[  396.254957] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.1.0 timeout, signaled seq=164, emitted seq=167
[  396.255330] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process llama-bench pid 3805 thread llama-bench pid 3805
[  396.255661] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[  396.566445] amdgpu 0000:03:00.0: amdgpu: MODE1 reset
[  396.566451] amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
[  396.566523] amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
[  397.087064] amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
[  397.087374] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[  397.087536] [drm] VRAM is lost due to GPU reset!
[  397.087539] [drm] PSP is resuming...
[  397.183383] [drm] reserve 0xa700000 from 0x83a0000000 for PSP TMR
...
geerlingguy commented 3 days ago

Did some more power measurements, also jotted down the idle power consumption for various cards, for example RX 6700 XT (this is total system power at wall — Pi 5 8GB + Pi 5 PSU + SFX 750W PSU + AMD RX 6700 XT at idle booted into Linux with no monitor attached: 11.4W

Screenshot 2024-11-18 at 4 31 14 PM
pepijndevos commented 2 days ago

There is an open PR to add Vulkan support to Ollama, but seemingly no interest form ollama in merging it. Might be worth trying out for home assistant and vscode

https://github.com/ollama/ollama/pull/5059

geerlingguy commented 1 day ago

Testing that PR:

# Get the code
git clone https://github.com/ollama/ollama.git
cd ollama
git fetch origin pull/5059/head:vulkan-5059
git checkout vulkan-5059

# Install Go
cd ..
wget https://go.dev/dl/go1.23.3.linux-arm64.tar.gz
rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.23.3.linux-arm64.tar.gz

# Build Ollama
sudo apt install -y libcap-dev
go generate ./...
go build .

Note: I had to run go generate ./... before the go build . to prep the repo with llama.cpp... but that is not using my custom llama.cpp build with Vulkan support from earlier. I'll see how it runs, then swap them out if necessary. First time building Ollama from source :)

geerlingguy commented 1 day ago

I'm getting:

[100%] Built target ollama_llama_server
+ rm -f ../build/linux/arm64/vulkan/bin/ggml-common.h ../build/linux/arm64/vulkan/bin/ggml-metal.metal
++ ldd ../build/linux/arm64/vulkan/bin/ollama_llama_server
++ grep '=>'
++ cut -f2 -d=
++ cut -f2 '-d '
++ grep -e vulkan -e cap
+ cp '/usr/lib//libvulkan.so*' ../build/linux/arm64/vulkan/bin/
cp: cannot stat '/usr/lib//libvulkan.so*': No such file or directory
llm/generate/generate_linux.go:3: running "bash": exit status 1

Looks like the code is expected x86_64, and not accounting for Arm (aarch64) in one of the file paths. Mentioned in https://github.com/ollama/ollama/pull/5059#discussion_r1851037921

But even with the following patch:

diff --git a/gpu/gpu_linux.go b/gpu/gpu_linux.go
index 76df6326..d6f882ef 100644
--- a/gpu/gpu_linux.go
+++ b/gpu/gpu_linux.go
@@ -53,12 +53,12 @@ var (
 )

 var VulkanGlobs = []string{
-       "/usr/lib/x86_64-linux-gnu/libvulkan.so*",
+       "/usr/lib/aarch64-linux-gnu/libvulkan.so*",
        "/usr/lib*/libvulkan.so*",
 }

 var capLinuxGlobs = []string{
-       "/usr/lib/x86_64-linux-gnu/libcap.so*",
+       "/usr/lib/aarch64-linux-gnu/libcap.so*",
        "/usr/lib*/libcap.so*",
 }

diff --git a/llm/llama.cpp b/llm/llama.cpp
index 8962422b..b46a372e 160000
--- a/llm/llama.cpp
+++ b/llm/llama.cpp

I can't get it to compile. Still gets stuck as above, I'm guessing something's being cached or I'm missing where exactly it's calling for the file copy.

pepijndevos commented 1 day ago

git clean -fx is my goto solution for build cache problems

I'm kinda in a similar boat trying to get home assistant and vscode talking to my Intel Arc, just minus the arm complication.

The continue vscode extension can work with openai compatible APIs like llama.cpp and vllm.

Home assistant repeatedly closes PRs that let you set the base url of the OpenAI integration, but there is a fork on HACS that supports this.

I haven't tested this but if I get anything vulkan driven going I'll let you know.

geerlingguy commented 1 day ago

Found VULKAN_ROOT in the PR, trying with:

VULKAN_ROOT=/usr/lib/aarch64-linux-gnu go generate ./...
geerlingguy commented 1 day ago

Very weird.

...
[100%] Linking CXX executable ../bin/ollama_llama_server
[100%] Built target ollama_llama_server
+ rm -f ../build/linux/arm64/vulkan/bin/ggml-common.h ../build/linux/arm64/vulkan/bin/ggml-metal.metal
++ ldd ../build/linux/arm64/vulkan/bin/ollama_llama_server
++ grep '=>'
++ cut -f2 -d=
++ grep -e vulkan -e cap
++ cut -f2 '-d '
+ cp '/usr/lib/aarch64-linux-gnu/libvulkan.so*' ../build/linux/arm64/vulkan/bin/
cp: cannot stat '/usr/lib/aarch64-linux-gnu/libvulkan.so*': No such file or directory
llm/generate/generate_linux.go:3: running "bash": exit status 1

$ ls /usr/lib/aarch64-linux-gnu/libvulkan.so*
/usr/lib/aarch64-linux-gnu/libvulkan.so    /usr/lib/aarch64-linux-gnu/libvulkan.so.1.3.239
/usr/lib/aarch64-linux-gnu/libvulkan.so.1
geerlingguy commented 1 day ago

Heh, I removed the double quotes around the cp in the PR, and now getting:

+ cp /usr/lib/aarch64-linux-gnu/libvulkan.so /usr/lib/aarch64-linux-gnu/libvulkan.so.1 /usr/lib/aarch64-linux-gnu/libvulkan.so.1.3.239 ../build/linux/arm64/vulkan/bin/
+ cp '/usr/lib//libcap.so*' ../build/linux/arm64/vulkan/bin/
cp: cannot stat '/usr/lib//libcap.so*': No such file or directory
llm/generate/generate_linux.go:3: running "bash": exit status 1

I've removed the quotes around the libcap.so copy task as well, and am trying:

CAP_ROOT=/usr/lib/aarch64-linux-gnu VULKAN_ROOT=/usr/lib/aarch64-linux-gnu go generate ./...
pepijndevos commented 1 day ago

Trying this comment rn https://github.com/ollama/ollama/pull/5059#issuecomment-2377129985

geerlingguy commented 1 day ago

I was also glancing around LM Studio—right now they don't build for arm64 Linux: https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/198

Seems like it wouldn't be too difficult to add on, but they supposedly have Vulkan support baked in already.

geerlingguy commented 1 day ago

Ah, if I run ollama serve:

time=2024-11-21T09:59:19.314-06:00 level=INFO source=gpu.go:233 msg="looking for compatible GPUs"
time=2024-11-21T09:59:19.332-06:00 level=WARN source=amd_linux.go:60 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-11-21T09:59:19.332-06:00 level=INFO source=amd_linux.go:361 msg="no compatible amdgpu devices detected"
time=2024-11-21T09:59:19.333-06:00 level=ERROR source=amd_linux.go:364 msg="amdgpu devices detected but permission problems block access" error="kfd driver not loaded.  If running in a container, remember to include '--device /dev/kfd --device /dev/dri'"
time=2024-11-21T09:59:19.333-06:00 level=INFO source=gpu.go:414 msg="no compatible GPUs were discovered"

And if I run with sudo:

sudo env OLLAMA_LLM_LIBRARY=vulkan /usr/bin/ollama serve
...
time=2024-11-21T10:00:15.723-06:00 level=INFO source=gpu.go:233 msg="looking for compatible GPUs"
time=2024-11-21T10:00:16.291-06:00 level=INFO source=gpu.go:391 msg="error looking up vulkan GPU memory" error="device is a CPU"
time=2024-11-21T10:00:16.291-06:00 level=WARN source=amd_linux.go:60 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-11-21T10:00:16.291-06:00 level=INFO source=amd_linux.go:361 msg="no compatible amdgpu devices detected"
time=2024-11-21T10:00:16.291-06:00 level=ERROR source=amd_linux.go:364 msg="amdgpu devices detected but permission problems block access" error="kfd driver not loaded.  If running in a container, remember to include '--device /dev/kfd --device /dev/dri'"
time=2024-11-21T10:00:16.291-06:00 level=INFO source=gpu.go:414 msg="no compatible GPUs were discovered"
time=2024-11-21T10:00:16.293-06:00 level=INFO source=types.go:114 msg="inference compute" id=0 library=vulkan variant="" compute=1.3 driver=1.3 name="AMD Radeon RX 6700 XT (RADV NAVI22)" total="12.0 GiB" available="11.9 GiB"
time=2024-11-21T10:00:16.293-06:00 level=INFO source=types.go:114 msg="inference compute" id=1 library=vulkan variant="" compute=1.2 driver=1.2 name="V3D 7.1.7" total="4.0 GiB" available="4.0 GiB"

And now, testing llama3.1:8b, it seems to bail out (I don't see any activity in nvtop either):

time=2024-11-21T10:08:10.951-06:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=0 parallel=4 available=12118605414 required="5.8 GiB"
time=2024-11-21T10:08:10.951-06:00 level=INFO source=server.go:103 msg="system memory" total="7.8 GiB" free="7.4 GiB" free_swap="9.5 MiB"
time=2024-11-21T10:08:10.953-06:00 level=INFO source=memory.go:326 msg="offload to vulkan" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[11.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.8 GiB" memory.required.partial="5.8 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-11-21T10:08:10.953-06:00 level=INFO source=server.go:169 msg="Invalid OLLAMA_LLM_LIBRARY vulkan - not found"
time=2024-11-21T10:08:10.954-06:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama943665785/runners/cpu/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 45499"
time=2024-11-21T10:08:10.954-06:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-11-21T10:08:10.954-06:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
/tmp/ollama943665785/runners/cpu/ollama_llama_server: error while loading shared libraries: libllama.so: cannot open shared object file: No such file or directory
time=2024-11-21T10:08:10.955-06:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
time=2024-11-21T10:08:11.205-06:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: exit status 127"
[GIN] 2024/11/21 - 10:08:11 | 500 |   387.89312ms |       127.0.0.1 | POST     "/api/generate"
time=2024-11-21T10:08:16.238-06:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.032373357 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-11-21T10:08:16.487-06:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.282000478 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-11-21T10:08:16.738-06:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.532274324 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
pepijndevos commented 1 day ago

"Invalid OLLAMA_LLM_LIBRARY vulkan - not found" so it seems like ollama did not actually build the vulkan runner, which I was also facing, but with the patch from the comment above it actually does build. During the build process look for output that lists runners like [dummy avx avx2 vulkan]

geerlingguy commented 1 day ago

@pepijndevos - Ah. I'll try the other patch you mentioned and re-test.

Git said the patch was corrupt after I hand-copied it and tried git apply -v so I hand-edited the file with the changes.

I still had to set CAP_ROOT and VULKAN_ROOT to get it to generate:

CAP_ROOT=/usr/lib/aarch64-linux-gnu VULKAN_ROOT=/usr/lib/aarch64-linux-gnu go generate ./...

Then go build . to build ollama. Now:

$ sudo env OLLAMA_LLM_LIBRARY=vulkan /usr/bin/ollama serve
...
time=2024-11-21T10:46:17.823-06:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu vulkan]"
time=2024-11-21T10:46:17.823-06:00 level=INFO source=gpu.go:233 msg="looking for compatible GPUs"
time=2024-11-21T10:46:17.841-06:00 level=WARN source=amd_linux.go:60 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-11-21T10:46:17.841-06:00 level=INFO source=amd_linux.go:361 msg="no compatible amdgpu devices detected"
time=2024-11-21T10:46:17.841-06:00 level=ERROR source=amd_linux.go:364 msg="amdgpu devices detected but permission problems block access" error="kfd driver not loaded.  If running in a container, remember to include '--device /dev/kfd --device /dev/dri'"
time=2024-11-21T10:46:17.841-06:00 level=INFO source=gpu.go:414 msg="no compatible GPUs were discovered"
time=2024-11-21T10:46:17.841-06:00 level=INFO source=types.go:114 msg="inference compute" id=0 library=cpu variant="no vector extensions" compute="" driver=0.0 name="" total="7.8 GiB" available="7.4 GiB"

And after I try ollama run,

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 6700 XT (RADV NAVI22) (radv) | uma: 0 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size =    0.14 MiB
ggml_vulkan: Failed to allocate pinned memory.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4437.93 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Vulkan_Host KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     2.02 MiB
llama_new_context_with_model: AMD Radeon RX 6700 XT (RADV NAVI22) compute buffer size =   669.48 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 420
INFO [main] model loaded | tid="548126799808" timestamp=1732207462
time=2024-11-21T10:44:22.858-06:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
time=2024-11-21T10:44:27.619-06:00 level=INFO source=server.go:626 msg="llama runner started in 42.97 seconds"

It seems to try loading some data into VRAM, but fails and then it all ends up back on the CPU.

pepijndevos commented 1 day ago

Could this be the same error in note2 of op?

export GGML_VK_FORCE_MAX_ALLOCATION_SIZE=2147483647  # 2GB buffer
export GGML_VK_FORCE_MAX_ALLOCATION_SIZE=1073741824  # 1GB buffer
geerlingguy commented 23 hours ago

Ran on the AMD RX 7600 8GB GPU:

Device CPU/GPU Model Speed Power (Peak)
Pi 5 - 8GB / AMD RX 7600 8GB GPU llama3.2:3b 48.47 Tokens/s 156 W
Pi 5 - 8GB / AMD RX 7600 8GB GPU llama3.1:8b 32.60 Tokens/s 174 W
Pi 5 - 8GB / AMD RX 7600 8GB GPU llama2:13b 2.42 Tokens/s 106 W

Idle power consumption for the entire setup with the 7600 is about 13.8W.

geerlingguy commented 23 hours ago

@pepijndevos - Ah, good catch, I've set it to 2147483647 and am testing again...

It looks like the load was more successful, but not completely.

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 7600 (RADV GFX1102) (radv) | uma: 0 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors: Vulkan_Host buffer size =  2226.70 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Vulkan_Host KV buffer size =   896.00 MiB
llama_new_context_with_model: KV self size  =  896.00 MiB, K (f16):  448.00 MiB, V (f16):  448.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     2.00 MiB
llama_new_context_with_model: AMD Radeon RX 7600 (RADV GFX1102) compute buffer size =   570.73 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    26.01 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 368
INFO [main] model loaded | tid="548182755264" timestamp=1732210292
time=2024-11-21T11:31:32.857-06:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
time=2024-11-21T11:31:36.798-06:00 level=INFO source=server.go:626 msg="llama runner started in 84.69 seconds"
[GIN] 2024/11/21 - 11:31:36 | 200 |         1m24s |       127.0.0.1 | POST     "/api/generate"

But it's just hanging when I try sending a message :/

pepijndevos commented 23 hours ago

offloaded 0/29 layers to GPU that's peculiar since it's effectively still running on CPU

Maybe try export OLLAMA_NUM_GPU=999 or something along those lines to force it to actually use the GPU

geerlingguy commented 19 hours ago

@pepijndevos - Still no dice, same result, it never seems to load more than maybe 200-300 MB of data into VRAM before giving up.