Open flatsiedatsie opened 10 months ago
Ubuntu doesn't even support the Vulkan Mesa driver you linked yet, so I doubt Tencent and beatmup are using GPU on RPI5. Vulkan Mesa is for graphics processing. You can't use it with OpenCL to multiply matrices. Even if we rewrote GGML in a shader language, libraries like OpenGL, GLFW, GLEW, etc. all depend on X Windows and can't run headlessly for general computation tasks like linear algebra. Broadcom claims their GPU is capable of general purpose computation:
Although they are physically located within, and closely coupled to the 3D system, the QPUs are also capable of providing a general-purpose computation resource for non-3D software, such as video codecs and ISP tasks. https://docs.broadcom.com/doc/12358545
The community project that lets Linux users write programs for Broadcom's GPU was abandoned three years ago and no longer builds. https://github.com/wimrijnders/V3DLib If you can show me how multiply a matrix on this GPU without depending on frameworks, then I'll reopen this issue and strongly consider supporting it.
Thanks for the enlightening explanation. That is both good and bad news. Great that you're also enthousiastic about Raspberry Pi optimization.. but sad to hear (and read) that there is so little support for VideoCore hardware.
Looks like someone actually did rewrite GGML in a shader language. Yesterday ggerganov/llama.cpp#2059 just got merged in llama.cpp which adds Vulkan support and a whole bunch of shaders. This gives me new hope that Raspberry Pi 5 GPU support will be possible. Unfortunately it doesn't appear possible today. If I build llama.cpp at head with make LLAMA_VULKAN=1
and run TinyLlama Q4_0 then I get this:
jart@pi5:~/llama.cpp$ ./main -e -m ~/TinyLlama-1.1B-Chat-v1.0.Q4_0.gguf -p '# Famous Speech\nFour score and' -n 50
Log start
main: build = 2008 (ceebbb5b)
main: built with cc (Ubuntu 13.2.0-4ubuntu3) 13.2.0 for aarch64-linux-gnu
main: seed = 1706575520
TU: error: ../src/freedreno/vulkan/tu_knl.cc:251: device /dev/dri/renderD128 (v3d) is not compatible with turnip (VK_ERROR_INCOMPATIBLE_DRIVER)
ggml_vulkan: Using V3D 7.1.7 | fp16: 0 | warp size: 16
I'm going to leave this open until we can circle back in possibly several months to a year, until the distro driver situation improves, or someone else leaves a comment here helping us figure out how to do this. In the mean time, please do try this yourself. It's possible I broke my Ubuntu install by using a PPA earlier.
Awesome! It seems someone else in that thread also ran into an issue.
I'll attempt building Llamafile from source on the Pi 5 and let you know how it goes.
It compiles and runs.
# Famous Speech\nFour score and seven years ago our
etc
This is on a Pi 5 8Gb with the latest Raspberry Pi Lite OS, fully updated/upgraded, and mesa vulkan drivers installed.
sudo apt-get update -y && sudo apt-get upgrade -y
sudo apt-get install libvulkan1 mesa-vulkan-drivers
git clone https://github.com/Mozilla-Ocho/llamafile.git
cd llamafile
make LLAMA_VULKAN=1
./o/llama.cpp/main/main -m YOUR_MODEL_PATH_HERE.gguf -p '# Famous Speech\nFour score and' -n 50
Whether it's actually GPU enhanced though.. I noticed this in the output:
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
The full log is below:
./o/llama.cpp/main/main -m /home/pi/.webthings/data/voco/llm/assistant/phi-2.Q4_K_S.gguf -p '# Famous Speech\nFour score and' -n 50
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
Log start
main: llamafile version 0.6.2
main: seed = 1706622373
llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from /home/pi/.webthings/data/voco/llm/assistant/phi-2.Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = phi2
llama_model_loader: - kv 1: general.name str = Phi2
llama_model_loader: - kv 2: phi2.context_length u32 = 2048
llama_model_loader: - kv 3: phi2.embedding_length u32 = 2560
llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 10240
llama_model_loader: - kv 5: phi2.block_count u32 = 32
llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32
llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32
llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32
llama_model_loader: - kv 10: general.file_type u32 = 14
llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 195 tensors
llama_model_loader: - type q4_K: 125 tensors
llama_model_loader: - type q5_K: 4 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = phi2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 51200
llm_load_print_meta: n_merges = 50000
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2560
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 32
llm_load_print_meta: n_embd_head_k = 80
llm_load_print_meta: n_embd_head_v = 80
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2560
llm_load_print_meta: n_embd_v_gqa = 2560
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 0.0e+00
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 10240
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 3B
llm_load_print_meta: model ftype = Q4_K - Small
llm_load_print_meta: model params = 2.78 B
llm_load_print_meta: model size = 1.50 GiB (4.64 BPW)
llm_load_print_meta: general.name = Phi2
llm_load_print_meta: BOS token = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token = 50256 '<|endoftext|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_tensors: ggml ctx size = 0.12 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 1539.00 MiB
...........................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 160.00 MiB
llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB
llama_new_context_with_model: CPU input buffer size = 6.01 MiB
llama_new_context_with_model: CPU compute buffer size = 115.50 MiB
llama_new_context_with_model: graph splits (measure): 1
system_info: n_threads = 4 / 4 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0
# Famous Speech\nFour score and seven years ago our
llama_print_timings: load time = 15421.26 ms
llama_print_timings: sample time = 3.15 ms / 4 runs ( 0.79 ms per token, 1269.84 tokens per second)
llama_print_timings: prompt eval time = 18739.26 ms / 8 tokens ( 2342.41 ms per token, 0.43 tokens per second)
llama_print_timings: eval time = 43841.51 ms / 3 runs (14613.84 ms per token, 0.07 tokens per second)
llama_print_timings: total time = 73117.41 ms / 11 tokens
Seems they are speedily fixing bugs in llama.cpp
issue: interactive mode is broken on Vulkan https://github.com/ggerganov/llama.cpp/issues/5217
Pull request https://github.com/ggerganov/llama.cpp/pull/5223
Whether it's actually GPU enhanced though.. I noticed this in the output:
llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/33 layers to GPU
You can offload layers to the GPU with the -ngl
argument which should give a much bigger speed improvement. Try -ngl 33
and if it crashes due to lack of GPU memory then just keep reducing the number until it works.
Thanks @Mar2ck !
It worked fine first try with -ngl 33
The speed difference doesn't seem noticable. Oddly, the base version itself seems to run much faster already today, compared to last time I tried. Back then it generated one word per second. Not sure why it's different now.
./o/llama.cpp/main/main -m /home/pi/.webthings/data/voco/llm/assistant/phi-2.Q4_K_S.gguf -p '# Famous Speech\nFour score and' -n 50
llama_print_timings: load time = 431.10 ms
llama_print_timings: sample time = 22.42 ms / 50 runs ( 0.45 ms per token, 2229.85 tokens per second)
llama_print_timings: prompt eval time = 886.69 ms / 8 tokens ( 110.84 ms per token, 9.02 tokens per second)
llama_print_timings: eval time = 8704.99 ms / 49 runs ( 177.65 ms per token, 5.63 tokens per second)
llama_print_timings: total time = 9637.94 ms / 57 tokens
./o/llama.cpp/main/main -m /home/pi/.webthings/data/voco/llm/assistant/phi-2.Q4_K_S.gguf -p '# Famous Speech\nFour score and' -n 50 -ngl 33
llama_print_timings: load time = 433.53 ms
llama_print_timings: sample time = 23.30 ms / 50 runs ( 0.47 ms per token, 2145.46 tokens per second)
llama_print_timings: prompt eval time = 896.34 ms / 8 tokens ( 112.04 ms per token, 8.93 tokens per second)
llama_print_timings: eval time = 8670.61 ms / 49 runs ( 176.95 ms per token, 5.65 tokens per second)
llama_print_timings: total time = 9615.66 ms / 57 tokens
// I added the logs. Technically speaking the GPU version is actually a little slower.. strange.
// I tried again, with full system reboots in between.
Non-GPU version:
# Famous Speech\nFour score and seven years ago our fathers brought forth on this continent, a new nation...')
output = speech.replace('Nation', 'Nation-State')
print(output)
Output:
'Four score and seven years ago
llama_print_timings: load time = 19469.40 ms
llama_print_timings: sample time = 22.69 ms / 50 runs ( 0.45 ms per token, 2203.61 tokens per second)
llama_print_timings: prompt eval time = 855.38 ms / 8 tokens ( 106.92 ms per token, 9.35 tokens per second)
llama_print_timings: eval time = 8079.14 ms / 49 runs ( 164.88 ms per token, 6.07 tokens per second)
llama_print_timings: total time = 8980.39 ms / 57 tokens
GPU version:
# Famous Speech\nFour score and seven years ago our fathers brought forth on this continent, a new nation...',
'The United States of America is the world\'s oldest surviving federation.\n...'],
['I have a dream that my four little children
llama_print_timings: load time = 25512.36 ms
llama_print_timings: sample time = 23.41 ms / 50 runs ( 0.47 ms per token, 2135.57 tokens per second)
llama_print_timings: prompt eval time = 876.47 ms / 8 tokens ( 109.56 ms per token, 9.13 tokens per second)
llama_print_timings: eval time = 8216.39 ms / 49 runs ( 167.68 ms per token, 5.96 tokens per second)
llama_print_timings: total time = 9142.51 ms / 57 tokens
Funny how both runs decided the prompt was programming related..
Wait a tick:
warning: --n-gpu-layers 33 was passed but no GPUs were found; falling back to CPU inference
https://www.phoronix.com/news/Raspberry-Pi-OS-Default-V3DV With the default installation of Vulkan driver in this OS, will it help you move forward? This is what I tested out local LLMs on Raspberry Pi 5. It's around 1 token/sec, very slow. https://aidatatools.com/2024/01/ollama-benchmark-on-raspberry-pi-5-ram-8gb/
@chuangtc That's great news, thanks for sharing.
Which model are you running though?
I got a lot more tokens than that running small models (tinyllama-1.1b-1t-openorca.Q4_K_M.gguf) on the CPU. On that topic, I look forward to seeing what the new mathematical functions created by @jart will do to improve running on the Pi further, as those are said to speed up context ingestion.
Here is what I am asking help on reddit. https://www.reddit.com/r/raspberry_pi/comments/1c24vga/how_to_make_llamafile_get_accelerated_during/ I noticed that there could be a bug in vulkaninfo --summary
jason@raspberrypi5:~ $ vulkaninfo --summary
WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Failed to CreateInstance in ICD 0. Skipping ICD.
==========
VULKANINFO
==========
Vulkan Instance Version: 1.3.239
Instance Extensions: count = 22
-------------------------------
VK_EXT_acquire_drm_display : extension revision 1
VK_EXT_acquire_xlib_display : extension revision 1
VK_EXT_debug_report : extension revision 10
VK_EXT_debug_utils : extension revision 2
VK_EXT_direct_mode_display : extension revision 1
VK_EXT_display_surface_counter : extension revision 1
VK_EXT_surface_maintenance1 : extension revision 1
VK_EXT_swapchain_colorspace : extension revision 4
VK_KHR_device_group_creation : extension revision 1
VK_KHR_display : extension revision 23
VK_KHR_external_fence_capabilities : extension revision 1
VK_KHR_external_memory_capabilities : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2 : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2 : extension revision 1
VK_KHR_portability_enumeration : extension revision 1
VK_KHR_surface : extension revision 25
VK_KHR_surface_protected_capabilities : extension revision 1
VK_KHR_wayland_surface : extension revision 6
VK_KHR_xcb_surface : extension revision 6
VK_KHR_xlib_surface : extension revision 6
Instance Layers: count = 2
--------------------------
VK_LAYER_MESA_device_select Linux device selection layer 1.3.211 version 1
VK_LAYER_MESA_overlay Mesa Overlay layer 1.3.211 version 1
Devices:
========
GPU0:
apiVersion = 1.2.255
driverVersion = 23.2.1
vendorID = 0x14e4
deviceID = 0x55701c33
deviceType = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
deviceName = V3D 7.1.7
driverID = DRIVER_ID_MESA_V3DV
driverName = V3DV Mesa
driverInfo = Mesa 23.2.1-1~bpo12+rpt3
conformanceVersion = 1.3.6.1
deviceUUID = 5fd8106e-741a-cafa-e080-fdb16cf11a80
driverUUID = 1698c6ef-161f-3213-5159-557202953ee9
GPU1:
apiVersion = 1.3.255
driverVersion = 0.0.1
vendorID = 0x10005
deviceID = 0x0000
deviceType = PHYSICAL_DEVICE_TYPE_CPU
deviceName = llvmpipe (LLVM 15.0.6, 128 bits)
driverID = DRIVER_ID_MESA_LLVMPIPE
driverName = llvmpipe
driverInfo = Mesa 23.2.1-1~bpo12+rpt3 (LLVM 15.0.6)
conformanceVersion = 1.3.1.1
deviceUUID = 6d657361-3233-2e32-2e31-2d317e627000
driverUUID = 6c6c766d-7069-7065-5555-494400000000
Raspberry Pi 5 doesn't have all the Vulkan 1.3 capabilities: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10896
Raspberry Pi 5 doesn't have all the Vulkan 1.3 capabilities: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10896
update:the missing exts now marked as supported
I've been watching this issue for awhile and finally picked up another Pi 5 to play with last week. Is there any work to be done or can we just use the latest release and enjoy all of the benefits that @jart has been working on?
llamafile has outstanding performance on RPI5 using CPU alone. This applies to most quants, e.g. F16, K quants, etc. I don't think GPU support is possible last time I checked. Even if it worked with llama.cpp upstream, we'd need to incorporate Vulkan binaries into our releases and maintain a fourth implementation of GGML.
I have model weights currently run in raspberry pi5 with cpu , but i want to use same model weights needs to run in raspberry pi5 with gpu , can some one help me out?
I'm experimenting with Llamafile on a Raspberry Pi 5 with 8Gb of ram, in order to integrate it with existing privacy-protecting smart home voice control. This is working great so far as long as I very small models are used.
I was wondering: would it be possible to speed up inference on the Rasbperry Pi 5 by using the GPU?
Through this Stack Overflow post I've found some frameworks that already do this, such as:
The Raspberry Pi 5's VideoCore GPU has vulkan drivers: https://www.phoronix.com/news/Mesa-RPi-5-VideoCore-7.1.x
Curious to your thoughts.
Related: https://github.com/Mozilla-Ocho/llamafile/issues/40