Feature Request: Improve Gemma v2 model performance on Vulkan backend

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Hi team, First of all, I'm grateful you guys keep improving this awesome project. I just discovered that using Vulkan backend on Linux or FreeBSD using Mesa Vulkan driver, the performance for Gemma-2-9B model is 4X slower than Llama-3-8B model: here's the results:

./llama-bench -m ~/Models/llama-3-8b-it_q6_k.gguf -n 64 -p 512 -ngl 99
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 7900 XT (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |         pp512 |   612.28 ± 85.52 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |          tg64 |     56.36 ± 0.92 |
build: 17eb6aa8 (3386)

me@bdw006:$ ./llama-bench -m ~/Models/gemma-2-9b-it_Q4_K_L.gguf -n 64 -p 512 -ngl 99 
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 7900 XT (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64
| gemma2 9B Q4_K - Medium        |   6.47 GiB |    10.16 B | Vulkan     |  99 |         pp512 |    134.82 ± 0.10 |
| gemma2 9B Q4_K - Medium        |   6.47 GiB |    10.16 B | Vulkan     |  99 |          tg64 |     17.47 ± 1.65 |

Here's my setup:

OS: FreeBSD-15-Current GPU Driver: drm-6.1-lts and mesa radv driver CPU: dual socket E5-2680v4 GPU: AMD 7900XT(20GB)

Motivation

Gemma-2 model is a high quality model for it's size. And vulkan backend optimization is very good addition

Possible Implementation

No response

ggerganov / llama.cpp