GPU speed-up on Raspberry Pi 5

flatsiedatsie commented 10 months ago

I'm experimenting with Llamafile on a Raspberry Pi 5 with 8Gb of ram, in order to integrate it with existing privacy-protecting smart home voice control. This is working great so far as long as I very small models are used.

I was wondering: would it be possible to speed up inference on the Rasbperry Pi 5 by using the GPU?

Through this Stack Overflow post I've found some frameworks that already do this, such as:

The Raspberry Pi 5's VideoCore GPU has vulkan drivers: https://www.phoronix.com/news/Mesa-RPi-5-VideoCore-7.1.x

Curious to your thoughts.

jart commented 10 months ago

Ubuntu doesn't even support the Vulkan Mesa driver you linked yet, so I doubt Tencent and beatmup are using GPU on RPI5. Vulkan Mesa is for graphics processing. You can't use it with OpenCL to multiply matrices. Even if we rewrote GGML in a shader language, libraries like OpenGL, GLFW, GLEW, etc. all depend on X Windows and can't run headlessly for general computation tasks like linear algebra. Broadcom claims their GPU is capable of general purpose computation:

Although they are physically located within, and closely coupled to the 3D system, the QPUs are also capable of providing a general-purpose computation resource for non-3D software, such as video codecs and ISP tasks. https://docs.broadcom.com/doc/12358545

The community project that lets Linux users write programs for Broadcom's GPU was abandoned three years ago and no longer builds. https://github.com/wimrijnders/V3DLib If you can show me how multiply a matrix on this GPU without depending on frameworks, then I'll reopen this issue and strongly consider supporting it.

flatsiedatsie commented 10 months ago

Thanks for the enlightening explanation. That is both good and bad news. Great that you're also enthousiastic about Raspberry Pi optimization.. but sad to hear (and read) that there is so little support for VideoCore hardware.

jart commented 10 months ago

Looks like someone actually did rewrite GGML in a shader language. Yesterday ggerganov/llama.cpp#2059 just got merged in llama.cpp which adds Vulkan support and a whole bunch of shaders. This gives me new hope that Raspberry Pi 5 GPU support will be possible. Unfortunately it doesn't appear possible today. If I build llama.cpp at head with make LLAMA_VULKAN=1 and run TinyLlama Q4_0 then I get this:

jart@pi5:~/llama.cpp$ ./main -e -m ~/TinyLlama-1.1B-Chat-v1.0.Q4_0.gguf -p '# Famous Speech\nFour score and' -n 50
Log start
main: build = 2008 (ceebbb5b)
main: built with cc (Ubuntu 13.2.0-4ubuntu3) 13.2.0 for aarch64-linux-gnu
main: seed  = 1706575520
TU: error: ../src/freedreno/vulkan/tu_knl.cc:251: device /dev/dri/renderD128 (v3d) is not compatible with turnip (VK_ERROR_INCOMPATIBLE_DRIVER)
ggml_vulkan: Using V3D 7.1.7 | fp16: 0 | warp size: 16

I'm going to leave this open until we can circle back in possibly several months to a year, until the distro driver situation improves, or someone else leaves a comment here helping us figure out how to do this. In the mean time, please do try this yourself. It's possible I broke my Ubuntu install by using a PPA earlier.

flatsiedatsie commented 10 months ago

Awesome! It seems someone else in that thread also ran into an issue.

I'll attempt building Llamafile from source on the Pi 5 and let you know how it goes.

flatsiedatsie commented 10 months ago

It compiles and runs.

# Famous Speech\nFour score and seven years ago our etc

This is on a Pi 5 8Gb with the latest Raspberry Pi Lite OS, fully updated/upgraded, and mesa vulkan drivers installed.

sudo apt-get update -y && sudo apt-get upgrade -y
sudo apt-get install libvulkan1 mesa-vulkan-drivers
git clone https://github.com/Mozilla-Ocho/llamafile.git
cd llamafile
make LLAMA_VULKAN=1
./o/llama.cpp/main/main -m YOUR_MODEL_PATH_HERE.gguf -p '# Famous Speech\nFour score and' -n 50

Whether it's actually GPU enhanced though.. I noticed this in the output:

llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU

The full log is below:

./o/llama.cpp/main/main -m /home/pi/.webthings/data/voco/llm/assistant/phi-2.Q4_K_S.gguf -p '# Famous Speech\nFour score and' -n 50
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
Log start
main: llamafile version 0.6.2
main: seed  = 1706622373
llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from /home/pi/.webthings/data/voco/llm/assistant/phi-2.Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = Phi2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2560
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 10240
llama_model_loader: - kv   5:                           phi2.block_count u32              = 32
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 14
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  195 tensors
llama_model_loader: - type q4_K:  125 tensors
llama_model_loader: - type q5_K:    4 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2560
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 80
llm_load_print_meta: n_embd_head_v    = 80
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2560
llm_load_print_meta: n_embd_v_gqa     = 2560
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 10240
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 2.78 B
llm_load_print_meta: model size       = 1.50 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Phi2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1539.00 MiB
...........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   160.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     6.01 MiB
llama_new_context_with_model:        CPU compute buffer size =   115.50 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 4 / 4 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0

# Famous Speech\nFour score and seven years ago our

llama_print_timings:        load time =   15421.26 ms
llama_print_timings:      sample time =       3.15 ms /     4 runs   (    0.79 ms per token,  1269.84 tokens per second)
llama_print_timings: prompt eval time =   18739.26 ms /     8 tokens ( 2342.41 ms per token,     0.43 tokens per second)
llama_print_timings:        eval time =   43841.51 ms /     3 runs   (14613.84 ms per token,     0.07 tokens per second)
llama_print_timings:       total time =   73117.41 ms /    11 tokens

flatsiedatsie commented 10 months ago

Seems they are speedily fixing bugs in llama.cpp

issue: interactive mode is broken on Vulkan https://github.com/ggerganov/llama.cpp/issues/5217

Pull request https://github.com/ggerganov/llama.cpp/pull/5223

Mar2ck commented 10 months ago

Whether it's actually GPU enhanced though.. I noticed this in the output:
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU

You can offload layers to the GPU with the -ngl argument which should give a much bigger speed improvement. Try -ngl 33 and if it crashes due to lack of GPU memory then just keep reducing the number until it works.

flatsiedatsie commented 10 months ago

Thanks @Mar2ck !

It worked fine first try with -ngl 33

The speed difference doesn't seem noticable. Oddly, the base version itself seems to run much faster already today, compared to last time I tried. Back then it generated one word per second. Not sure why it's different now.

BEFORE

./o/llama.cpp/main/main -m /home/pi/.webthings/data/voco/llm/assistant/phi-2.Q4_K_S.gguf -p '# Famous Speech\nFour score and' -n 50

llamafile_without

llama_print_timings:        load time =     431.10 ms
llama_print_timings:      sample time =      22.42 ms /    50 runs   (    0.45 ms per token,  2229.85 tokens per second)
llama_print_timings: prompt eval time =     886.69 ms /     8 tokens (  110.84 ms per token,     9.02 tokens per second)
llama_print_timings:        eval time =    8704.99 ms /    49 runs   (  177.65 ms per token,     5.63 tokens per second)
llama_print_timings:       total time =    9637.94 ms /    57 tokens

AFTER

./o/llama.cpp/main/main -m /home/pi/.webthings/data/voco/llm/assistant/phi-2.Q4_K_S.gguf -p '# Famous Speech\nFour score and' -n 50 -ngl 33

llamafile_speedup

llama_print_timings:        load time =     433.53 ms
llama_print_timings:      sample time =      23.30 ms /    50 runs   (    0.47 ms per token,  2145.46 tokens per second)
llama_print_timings: prompt eval time =     896.34 ms /     8 tokens (  112.04 ms per token,     8.93 tokens per second)
llama_print_timings:        eval time =    8670.61 ms /    49 runs   (  176.95 ms per token,     5.65 tokens per second)
llama_print_timings:       total time =    9615.66 ms /    57 tokens

// I added the logs. Technically speaking the GPU version is actually a little slower.. strange.

// I tried again, with full system reboots in between.

Non-GPU version:

# Famous Speech\nFour score and seven years ago our fathers brought forth on this continent, a new nation...')
    output = speech.replace('Nation', 'Nation-State')
    print(output)

    Output:
    'Four score and seven years ago

llama_print_timings:        load time =   19469.40 ms
llama_print_timings:      sample time =      22.69 ms /    50 runs   (    0.45 ms per token,  2203.61 tokens per second)
llama_print_timings: prompt eval time =     855.38 ms /     8 tokens (  106.92 ms per token,     9.35 tokens per second)
llama_print_timings:        eval time =    8079.14 ms /    49 runs   (  164.88 ms per token,     6.07 tokens per second)
llama_print_timings:       total time =    8980.39 ms /    57 tokens

GPU version:

# Famous Speech\nFour score and seven years ago our fathers brought forth on this continent, a new nation...',
            'The United States of America is the world\'s oldest surviving federation.\n...'],
        ['I have a dream that my four little children

llama_print_timings:        load time =   25512.36 ms
llama_print_timings:      sample time =      23.41 ms /    50 runs   (    0.47 ms per token,  2135.57 tokens per second)
llama_print_timings: prompt eval time =     876.47 ms /     8 tokens (  109.56 ms per token,     9.13 tokens per second)
llama_print_timings:        eval time =    8216.39 ms /    49 runs   (  167.68 ms per token,     5.96 tokens per second)
llama_print_timings:       total time =    9142.51 ms /    57 tokens

Funny how both runs decided the prompt was programming related..

flatsiedatsie commented 10 months ago

Wait a tick:

warning: --n-gpu-layers 33 was passed but no GPUs were found; falling back to CPU inference

chuangtc commented 7 months ago

https://www.phoronix.com/news/Raspberry-Pi-OS-Default-V3DV With the default installation of Vulkan driver in this OS, will it help you move forward? This is what I tested out local LLMs on Raspberry Pi 5. It's around 1 token/sec, very slow. https://aidatatools.com/2024/01/ollama-benchmark-on-raspberry-pi-5-ram-8gb/

flatsiedatsie commented 7 months ago

@chuangtc That's great news, thanks for sharing.

Which model are you running though?

I got a lot more tokens than that running small models (tinyllama-1.1b-1t-openorca.Q4_K_M.gguf) on the CPU. On that topic, I look forward to seeing what the new mathematical functions created by @jart will do to improve running on the Pi further, as those are said to speed up context ingestion.

chuangtc commented 7 months ago

Here is what I am asking help on reddit. https://www.reddit.com/r/raspberry_pi/comments/1c24vga/how_to_make_llamafile_get_accelerated_during/ I noticed that there could be a bug in vulkaninfo --summary

jason@raspberrypi5:~ $ vulkaninfo --summary
WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Failed to CreateInstance in ICD 0.  Skipping ICD.
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.239

chuangtc commented 7 months ago


Instance Extensions: count = 22
-------------------------------
VK_EXT_acquire_drm_display             : extension revision 1
VK_EXT_acquire_xlib_display            : extension revision 1
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_direct_mode_display             : extension revision 1
VK_EXT_display_surface_counter         : extension revision 1
VK_EXT_surface_maintenance1            : extension revision 1
VK_EXT_swapchain_colorspace            : extension revision 4
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_display                         : extension revision 23
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2         : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_surface_protected_capabilities  : extension revision 1
VK_KHR_wayland_surface                 : extension revision 6
VK_KHR_xcb_surface                     : extension revision 6
VK_KHR_xlib_surface                    : extension revision 6

Instance Layers: count = 2
--------------------------
VK_LAYER_MESA_device_select Linux device selection layer 1.3.211  version 1
VK_LAYER_MESA_overlay       Mesa Overlay layer           1.3.211  version 1

Devices:
========
GPU0:
    apiVersion         = 1.2.255
    driverVersion      = 23.2.1
    vendorID           = 0x14e4
    deviceID           = 0x55701c33
    deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
    deviceName         = V3D 7.1.7
    driverID           = DRIVER_ID_MESA_V3DV
    driverName         = V3DV Mesa
    driverInfo         = Mesa 23.2.1-1~bpo12+rpt3
    conformanceVersion = 1.3.6.1
    deviceUUID         = 5fd8106e-741a-cafa-e080-fdb16cf11a80
    driverUUID         = 1698c6ef-161f-3213-5159-557202953ee9
GPU1:
    apiVersion         = 1.3.255
    driverVersion      = 0.0.1
    vendorID           = 0x10005
    deviceID           = 0x0000
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM 15.0.6, 128 bits)
    driverID           = DRIVER_ID_MESA_LLVMPIPE
    driverName         = llvmpipe
    driverInfo         = Mesa 23.2.1-1~bpo12+rpt3 (LLVM 15.0.6)
    conformanceVersion = 1.3.1.1
    deviceUUID         = 6d657361-3233-2e32-2e31-2d317e627000
    driverUUID         = 6c6c766d-7069-7065-5555-494400000000

martincerven commented 7 months ago

Raspberry Pi 5 doesn't have all the Vulkan 1.3 capabilities: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10896

nekoteai commented 5 months ago

Raspberry Pi 5 doesn't have all the Vulkan 1.3 capabilities: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10896

update:the missing exts now marked as supported

coderaaron commented 2 months ago

I've been watching this issue for awhile and finally picked up another Pi 5 to play with last week. Is there any work to be done or can we just use the latest release and enjoy all of the benefits that @jart has been working on?

jart commented 2 months ago

llamafile has outstanding performance on RPI5 using CPU alone. This applies to most quants, e.g. F16, K quants, etc. I don't think GPU support is possible last time I checked. Even if it worked with llama.cpp upstream, we'd need to incorporate Vulkan binaries into our releases and maintain a fourth implementation of GGML.

chiranjeevi-205 commented 2 months ago

I have model weights currently run in raspberry pi5 with cpu , but i want to use same model weights needs to run in raspberry pi5 with gpu , can some one help me out?

Mozilla-Ocho / llamafile

GPU speed-up on Raspberry Pi 5 #226

BEFORE

AFTER