Open geerlingguy opened 3 days ago
To test on the Pi's CPU instead of GPU, I ran the same commands as above without -ngl 33
.
Also, over on Reddit, user u/kryptkpr suggested using ollama-bench.
Idle power of Pi 5 + RX 6500 XT (monitor turned off, USB SSD plugged in): 11W
pi@pi5-pcie:~/Downloads/llama.cpp $ ./build/bin/llama-bench -m models/llama-3.2-1b-instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 1B Q4_K - Medium | 762.81 MiB | 1.24 B | Vulkan | 99 | pp512 | 1079.41 ± 1.12 |
llama 1B Q4_K - Medium | 762.81 MiB | 1.24 B | Vulkan | 99 | pp4096 | 879.92 ± 0.66 |
llama 1B Q4_K - Medium | 762.81 MiB | 1.24 B | Vulkan | 99 | tg128 | 70.75 ± 0.14 |
llama 1B Q4_K - Medium | 762.81 MiB | 1.24 B | Vulkan | 99 | pp4096+tg128 | 548.34 ± 6.59 |
Consumed around 105W on average during the tests.
Idle power of Pi 5 + RX 6700 XT (monitor turned off, USB SSD plugged in): 11.7W
pi@pi5-pcie:~/Downloads/llama.cpp $ ./build/bin/llama-bench -m models/llama-3.2-1b-instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 1B Q4_K - Medium | 762.81 MiB | 1.24 B | Vulkan | 99 | pp512 | 20.05 ± 0.05 |
llama 1B Q4_K - Medium | 762.81 MiB | 1.24 B | Vulkan | 99 | pp4096 | 1356.21 ± 59.36 |
llama 1B Q4_K - Medium | 762.81 MiB | 1.24 B | Vulkan | 99 | tg128 | 77.21 ± 0.19 |
llama 1B Q4_K - Medium | 762.81 MiB | 1.24 B | Vulkan | 99 | pp4096+tg128 | 824.60 ± 1.40 |
Consumed around 172W on average during the tests.
pi@pi5-pcie:~/Downloads/llama.cpp $ ./build/bin/llama-bench -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | Vulkan | 99 | pp512 | 779.05 ± 68.94 |
llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | Vulkan | 99 | pp4096 | 650.14 ± 1.00 |
llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | Vulkan | 99 | tg128 | 51.71 ± 3.02 |
llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | Vulkan | 99 | pp4096+tg128 | 406.28 ± 2.45 |
Consumed around 195W on average during the tests.
Idle power of Pi 5 + W7700 (monitor turned off, USB SSD plugged in): 19.1W
pi@pi5-pcie:~/Downloads/llama.cpp $ ./build/bin/llama-bench -m models/llama-3.2-1b-instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 1B Q4_K - Medium | 762.81 MiB | 1.24 B | Vulkan | 99 | pp512 | 33.27 ± 0.01 |
llama 1B Q4_K - Medium | 762.81 MiB | 1.24 B | Vulkan | 99 | pp4096 | 1758.46 ± 3.79 |
llama 1B Q4_K - Medium | 762.81 MiB | 1.24 B | Vulkan | 99 | tg128 | 84.54 ± 0.15 |
llama 1B Q4_K - Medium | 762.81 MiB | 1.24 B | Vulkan | 99 | pp4096+tg128 | 970.73 ± 7.66 |
Consumed around 95W on average during the tests. (Jumping to 147W for the last bit.)
pi@pi5-pcie:~/Downloads/llama.cpp $ ./build/bin/llama-bench -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | Vulkan | 99 | pp512 | 882.01 ± 20.38 |
llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | Vulkan | 99 | pp4096 | 744.28 ± 3.14 |
llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | Vulkan | 99 | tg128 | 57.54 ± 1.03 |
llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | Vulkan | 99 | pp4096+tg128 | 460.41 ± 3.69 |
Consumed around 174W on average during the tests.
Also tested, as suggested on Reddit, a larger 9GB model, Qwen2.5-14B-Instruct-Q4_K_M.gguf. This was on the RX 6700 XT.
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
qwen2 ?B Q4_K - Medium | 8.37 GiB | 14.77 B | Vulkan | 99 | pp512 | 188.75 ± 0.08 |
qwen2 ?B Q4_K - Medium | 8.37 GiB | 14.77 B | Vulkan | 99 | pp4096 | 166.16 ± 0.04 |
qwen2 ?B Q4_K - Medium | 8.37 GiB | 14.77 B | Vulkan | 99 | tg128 | 21.27 ± 0.27 |
qwen2 ?B Q4_K - Medium | 8.37 GiB | 14.77 B | Vulkan | 99 | pp4096+tg128 | 124.52 ± 0.56 |
Consumed around 205W on average during the tests.
And again, tested Mistral-Small-Instruct-2409-Q4_K_M.gguf:
Consumed around 90W on average during the tests.
But after running a number of times, I would get a GPU reset halfway through:
[ 396.254957] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.1.0 timeout, signaled seq=164, emitted seq=167
[ 396.255330] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process llama-bench pid 3805 thread llama-bench pid 3805
[ 396.255661] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[ 396.566445] amdgpu 0000:03:00.0: amdgpu: MODE1 reset
[ 396.566451] amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
[ 396.566523] amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
[ 397.087064] amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 397.087374] [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
[ 397.087536] [drm] VRAM is lost due to GPU reset!
[ 397.087539] [drm] PSP is resuming...
[ 397.183383] [drm] reserve 0xa700000 from 0x83a0000000 for PSP TMR
...
Did some more power measurements, also jotted down the idle power consumption for various cards, for example RX 6700 XT (this is total system power at wall — Pi 5 8GB + Pi 5 PSU + SFX 750W PSU + AMD RX 6700 XT at idle booted into Linux with no monitor attached: 11.4W
There is an open PR to add Vulkan support to Ollama, but seemingly no interest form ollama in merging it. Might be worth trying out for home assistant and vscode
Testing that PR:
# Get the code
git clone https://github.com/ollama/ollama.git
cd ollama
git fetch origin pull/5059/head:vulkan-5059
git checkout vulkan-5059
# Install Go
cd ..
wget https://go.dev/dl/go1.23.3.linux-arm64.tar.gz
rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.23.3.linux-arm64.tar.gz
# Build Ollama
sudo apt install -y libcap-dev
go generate ./...
go build .
Note: I had to run go generate ./...
before the go build .
to prep the repo with llama.cpp
... but that is not using my custom llama.cpp
build with Vulkan support from earlier. I'll see how it runs, then swap them out if necessary. First time building Ollama from source :)
I'm getting:
[100%] Built target ollama_llama_server
+ rm -f ../build/linux/arm64/vulkan/bin/ggml-common.h ../build/linux/arm64/vulkan/bin/ggml-metal.metal
++ ldd ../build/linux/arm64/vulkan/bin/ollama_llama_server
++ grep '=>'
++ cut -f2 -d=
++ cut -f2 '-d '
++ grep -e vulkan -e cap
+ cp '/usr/lib//libvulkan.so*' ../build/linux/arm64/vulkan/bin/
cp: cannot stat '/usr/lib//libvulkan.so*': No such file or directory
llm/generate/generate_linux.go:3: running "bash": exit status 1
Looks like the code is expected x86_64
, and not accounting for Arm (aarch64
) in one of the file paths. Mentioned in https://github.com/ollama/ollama/pull/5059#discussion_r1851037921
But even with the following patch:
diff --git a/gpu/gpu_linux.go b/gpu/gpu_linux.go
index 76df6326..d6f882ef 100644
--- a/gpu/gpu_linux.go
+++ b/gpu/gpu_linux.go
@@ -53,12 +53,12 @@ var (
)
var VulkanGlobs = []string{
- "/usr/lib/x86_64-linux-gnu/libvulkan.so*",
+ "/usr/lib/aarch64-linux-gnu/libvulkan.so*",
"/usr/lib*/libvulkan.so*",
}
var capLinuxGlobs = []string{
- "/usr/lib/x86_64-linux-gnu/libcap.so*",
+ "/usr/lib/aarch64-linux-gnu/libcap.so*",
"/usr/lib*/libcap.so*",
}
diff --git a/llm/llama.cpp b/llm/llama.cpp
index 8962422b..b46a372e 160000
--- a/llm/llama.cpp
+++ b/llm/llama.cpp
I can't get it to compile. Still gets stuck as above, I'm guessing something's being cached or I'm missing where exactly it's calling for the file copy.
git clean -fx
is my goto solution for build cache problems
I'm kinda in a similar boat trying to get home assistant and vscode talking to my Intel Arc, just minus the arm complication.
The continue vscode extension can work with openai compatible APIs like llama.cpp and vllm.
Home assistant repeatedly closes PRs that let you set the base url of the OpenAI integration, but there is a fork on HACS that supports this.
I haven't tested this but if I get anything vulkan driven going I'll let you know.
Found VULKAN_ROOT
in the PR, trying with:
VULKAN_ROOT=/usr/lib/aarch64-linux-gnu go generate ./...
Very weird.
...
[100%] Linking CXX executable ../bin/ollama_llama_server
[100%] Built target ollama_llama_server
+ rm -f ../build/linux/arm64/vulkan/bin/ggml-common.h ../build/linux/arm64/vulkan/bin/ggml-metal.metal
++ ldd ../build/linux/arm64/vulkan/bin/ollama_llama_server
++ grep '=>'
++ cut -f2 -d=
++ grep -e vulkan -e cap
++ cut -f2 '-d '
+ cp '/usr/lib/aarch64-linux-gnu/libvulkan.so*' ../build/linux/arm64/vulkan/bin/
cp: cannot stat '/usr/lib/aarch64-linux-gnu/libvulkan.so*': No such file or directory
llm/generate/generate_linux.go:3: running "bash": exit status 1
$ ls /usr/lib/aarch64-linux-gnu/libvulkan.so*
/usr/lib/aarch64-linux-gnu/libvulkan.so /usr/lib/aarch64-linux-gnu/libvulkan.so.1.3.239
/usr/lib/aarch64-linux-gnu/libvulkan.so.1
Heh, I removed the double quotes around the cp
in the PR, and now getting:
+ cp /usr/lib/aarch64-linux-gnu/libvulkan.so /usr/lib/aarch64-linux-gnu/libvulkan.so.1 /usr/lib/aarch64-linux-gnu/libvulkan.so.1.3.239 ../build/linux/arm64/vulkan/bin/
+ cp '/usr/lib//libcap.so*' ../build/linux/arm64/vulkan/bin/
cp: cannot stat '/usr/lib//libcap.so*': No such file or directory
llm/generate/generate_linux.go:3: running "bash": exit status 1
I've removed the quotes around the libcap.so copy task as well, and am trying:
CAP_ROOT=/usr/lib/aarch64-linux-gnu VULKAN_ROOT=/usr/lib/aarch64-linux-gnu go generate ./...
Trying this comment rn https://github.com/ollama/ollama/pull/5059#issuecomment-2377129985
I was also glancing around LM Studio—right now they don't build for arm64 Linux: https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/198
Seems like it wouldn't be too difficult to add on, but they supposedly have Vulkan support baked in already.
Ah, if I run ollama serve
:
time=2024-11-21T09:59:19.314-06:00 level=INFO source=gpu.go:233 msg="looking for compatible GPUs"
time=2024-11-21T09:59:19.332-06:00 level=WARN source=amd_linux.go:60 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-11-21T09:59:19.332-06:00 level=INFO source=amd_linux.go:361 msg="no compatible amdgpu devices detected"
time=2024-11-21T09:59:19.333-06:00 level=ERROR source=amd_linux.go:364 msg="amdgpu devices detected but permission problems block access" error="kfd driver not loaded. If running in a container, remember to include '--device /dev/kfd --device /dev/dri'"
time=2024-11-21T09:59:19.333-06:00 level=INFO source=gpu.go:414 msg="no compatible GPUs were discovered"
And if I run with sudo
:
sudo env OLLAMA_LLM_LIBRARY=vulkan /usr/bin/ollama serve
...
time=2024-11-21T10:00:15.723-06:00 level=INFO source=gpu.go:233 msg="looking for compatible GPUs"
time=2024-11-21T10:00:16.291-06:00 level=INFO source=gpu.go:391 msg="error looking up vulkan GPU memory" error="device is a CPU"
time=2024-11-21T10:00:16.291-06:00 level=WARN source=amd_linux.go:60 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-11-21T10:00:16.291-06:00 level=INFO source=amd_linux.go:361 msg="no compatible amdgpu devices detected"
time=2024-11-21T10:00:16.291-06:00 level=ERROR source=amd_linux.go:364 msg="amdgpu devices detected but permission problems block access" error="kfd driver not loaded. If running in a container, remember to include '--device /dev/kfd --device /dev/dri'"
time=2024-11-21T10:00:16.291-06:00 level=INFO source=gpu.go:414 msg="no compatible GPUs were discovered"
time=2024-11-21T10:00:16.293-06:00 level=INFO source=types.go:114 msg="inference compute" id=0 library=vulkan variant="" compute=1.3 driver=1.3 name="AMD Radeon RX 6700 XT (RADV NAVI22)" total="12.0 GiB" available="11.9 GiB"
time=2024-11-21T10:00:16.293-06:00 level=INFO source=types.go:114 msg="inference compute" id=1 library=vulkan variant="" compute=1.2 driver=1.2 name="V3D 7.1.7" total="4.0 GiB" available="4.0 GiB"
And now, testing llama3.1:8b
, it seems to bail out (I don't see any activity in nvtop
either):
time=2024-11-21T10:08:10.951-06:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe gpu=0 parallel=4 available=12118605414 required="5.8 GiB"
time=2024-11-21T10:08:10.951-06:00 level=INFO source=server.go:103 msg="system memory" total="7.8 GiB" free="7.4 GiB" free_swap="9.5 MiB"
time=2024-11-21T10:08:10.953-06:00 level=INFO source=memory.go:326 msg="offload to vulkan" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[11.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.8 GiB" memory.required.partial="5.8 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-11-21T10:08:10.953-06:00 level=INFO source=server.go:169 msg="Invalid OLLAMA_LLM_LIBRARY vulkan - not found"
time=2024-11-21T10:08:10.954-06:00 level=INFO source=server.go:388 msg="starting llama server" cmd="/tmp/ollama943665785/runners/cpu/ollama_llama_server --model /root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --parallel 4 --port 45499"
time=2024-11-21T10:08:10.954-06:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-11-21T10:08:10.954-06:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
/tmp/ollama943665785/runners/cpu/ollama_llama_server: error while loading shared libraries: libllama.so: cannot open shared object file: No such file or directory
time=2024-11-21T10:08:10.955-06:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
time=2024-11-21T10:08:11.205-06:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: exit status 127"
[GIN] 2024/11/21 - 10:08:11 | 500 | 387.89312ms | 127.0.0.1 | POST "/api/generate"
time=2024-11-21T10:08:16.238-06:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.032373357 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-11-21T10:08:16.487-06:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.282000478 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
time=2024-11-21T10:08:16.738-06:00 level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.532274324 model=/root/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe
"Invalid OLLAMA_LLM_LIBRARY vulkan - not found"
so it seems like ollama did not actually build the vulkan runner, which I was also facing, but with the patch from the comment above it actually does build. During the build process look for output that lists runners like [dummy avx avx2 vulkan]
@pepijndevos - Ah. I'll try the other patch you mentioned and re-test.
Git said the patch was corrupt after I hand-copied it and tried git apply -v
so I hand-edited the file with the changes.
I still had to set CAP_ROOT
and VULKAN_ROOT
to get it to generate:
CAP_ROOT=/usr/lib/aarch64-linux-gnu VULKAN_ROOT=/usr/lib/aarch64-linux-gnu go generate ./...
Then go build .
to build ollama
. Now:
$ sudo env OLLAMA_LLM_LIBRARY=vulkan /usr/bin/ollama serve
...
time=2024-11-21T10:46:17.823-06:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu vulkan]"
time=2024-11-21T10:46:17.823-06:00 level=INFO source=gpu.go:233 msg="looking for compatible GPUs"
time=2024-11-21T10:46:17.841-06:00 level=WARN source=amd_linux.go:60 msg="ollama recommends running the https://www.amd.com/en/support/linux-drivers" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
time=2024-11-21T10:46:17.841-06:00 level=INFO source=amd_linux.go:361 msg="no compatible amdgpu devices detected"
time=2024-11-21T10:46:17.841-06:00 level=ERROR source=amd_linux.go:364 msg="amdgpu devices detected but permission problems block access" error="kfd driver not loaded. If running in a container, remember to include '--device /dev/kfd --device /dev/dri'"
time=2024-11-21T10:46:17.841-06:00 level=INFO source=gpu.go:414 msg="no compatible GPUs were discovered"
time=2024-11-21T10:46:17.841-06:00 level=INFO source=types.go:114 msg="inference compute" id=0 library=cpu variant="no vector extensions" compute="" driver=0.0 name="" total="7.8 GiB" available="7.4 GiB"
And after I try ollama run
,
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 6700 XT (RADV NAVI22) (radv) | uma: 0 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size = 0.14 MiB
ggml_vulkan: Failed to allocate pinned memory.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 4437.93 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Vulkan_Host KV buffer size = 1024.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 2.02 MiB
llama_new_context_with_model: AMD Radeon RX 6700 XT (RADV NAVI22) compute buffer size = 669.48 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 24.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 420
INFO [main] model loaded | tid="548126799808" timestamp=1732207462
time=2024-11-21T10:44:22.858-06:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
time=2024-11-21T10:44:27.619-06:00 level=INFO source=server.go:626 msg="llama runner started in 42.97 seconds"
It seems to try loading some data into VRAM, but fails and then it all ends up back on the CPU.
Could this be the same error in note2 of op?
export GGML_VK_FORCE_MAX_ALLOCATION_SIZE=2147483647 # 2GB buffer
export GGML_VK_FORCE_MAX_ALLOCATION_SIZE=1073741824 # 1GB buffer
Ran on the AMD RX 7600 8GB GPU:
Device | CPU/GPU | Model | Speed | Power (Peak) |
---|---|---|---|---|
Pi 5 - 8GB / AMD RX 7600 8GB | GPU | llama3.2:3b | 48.47 Tokens/s | 156 W |
Pi 5 - 8GB / AMD RX 7600 8GB | GPU | llama3.1:8b | 32.60 Tokens/s | 174 W |
Pi 5 - 8GB / AMD RX 7600 8GB | GPU | llama2:13b | 2.42 Tokens/s | 106 W |
Idle power consumption for the entire setup with the 7600 is about 13.8W.
@pepijndevos - Ah, good catch, I've set it to 2147483647
and am testing again...
It looks like the load was more successful, but not completely.
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 7600 (RADV GFX1102) (radv) | uma: 0 | fp16: 1 | warp size: 64
llm_load_tensors: ggml ctx size = 0.12 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/29 layers to GPU
llm_load_tensors: Vulkan_Host buffer size = 2226.70 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Vulkan_Host KV buffer size = 896.00 MiB
llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 2.00 MiB
llama_new_context_with_model: AMD Radeon RX 7600 (RADV GFX1102) compute buffer size = 570.73 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 26.01 MiB
llama_new_context_with_model: graph nodes = 902
llama_new_context_with_model: graph splits = 368
INFO [main] model loaded | tid="548182755264" timestamp=1732210292
time=2024-11-21T11:31:32.857-06:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
time=2024-11-21T11:31:36.798-06:00 level=INFO source=server.go:626 msg="llama runner started in 84.69 seconds"
[GIN] 2024/11/21 - 11:31:36 | 200 | 1m24s | 127.0.0.1 | POST "/api/generate"
But it's just hanging when I try sending a message :/
offloaded 0/29 layers to GPU
that's peculiar since it's effectively still running on CPU
Maybe try export OLLAMA_NUM_GPU=999
or something along those lines to force it to actually use the GPU
@pepijndevos - Still no dice, same result, it never seems to load more than maybe 200-300 MB of data into VRAM before giving up.
To get this to work, first you have to get an external AMD GPU working on Pi OS. The most up-to-date instructions are currently on my website: Get an AMD Radeon 6000/7000-series GPU running on Pi 5.
Once your AMD graphics card is working (and can output video), install dependencies and compile
llama.cpp
with the Vulkan backend:Then you can download a model (e.g. off HuggingFace) and run it:
I want to test with models:
Note: Ollama currently doesn't support Vulkan, and some parts of llama.cpp assume x86 still, not Arm or RISC-V.
Note 2: With larger models, you may run into an error like
vk::Device::allocateMemory: ErrorOutOfDeviceMemory
—see bug Vulkan Device memory allocation failed. If so, try scaling back to 1 or 2 GB of RAM for the buffer:Note 3: Power consumption measured at the wall (total system power draw) using a ThirdReality Zigbee Smart Outlet through Home Assistant. I don't have a way of measuring total energy consumed per test (e.g. Joules) but that would be nice at some point.